Introduction

The United Kingdom’s decision to withdraw from the European Union, commonly known as Brexit, stands as one of the most consequential political developments in recent history. The discourse surrounding Brexit was formally initiated by Prime Minister David Cameron’s speech on January 23, 2013, in which he proposed an in-out referendum on the UK’s membership in the EU (Cameron, 2013). This debate reached its climax with the referendum held on June 23, 2016, resulting in a vote in favor of leaving the EU (BBC News, 2016). Analyzing the sentiments and rhetoric employed in parliamentary debates during this period offers valuable insights into the political and public attitudes that shaped the outcome of the Brexit vote (Smith, 2017).

Motivation

Sentiment analysis and natural language processing (NLP) are essential tools for analyzing large volumes of text, such as parliamentary debates. These methods can help identify the emotions and rhetorical strategies used by legislators, which can influence public opinion. Research by Osnabrügge et al. (2021) has shown that emotive rhetoric is particularly pronounced in high-profile legislative debates, such as Prime Minister’s Questions, and is used by legislators to appeal directly to voters. By applying sentiment analysis to the debates in the House of Commons, we can measure the sentiment towards Brexit and understand the emotional tone of the discussions.

Methodology

Installing libraries

library(conText)
library(dplyr)
library(readr)
library(tidyr)
library(lubridate)
library(openxlsx)
library(text2vec)
library(data.table)
library(stringr)
library(tm)  
library(readxl)
library(ggplot2)
library(ggimage)
library(patchwork)
library(magick)
library(patchwork)
library(cowplot)
library(plotly)
library(reticulate)
library(tidytext)
library(widyr)
library(tibble)
library(gt)
library(broom)
library(ggtext)
library(sf)

Here you should set your own working directory.

# Read CSV and specify column types
setwd("~/Desktop/Master/TFM/TFM_final")

Pre-processing ParlVote dataset

Creating a date variable and renaming variables

The dataset automatically omitted non- speech elements included in the transcripts such as ‘[laugh- ter]’ and ‘rose—’ (they have been deleted and cleaned already for this dataset) - Talk about how the authors have pre-processed the data.

First we will extract and create a date column:

# Read the csv file
ParlVote_concat <- read_csv("ParlVote_concat.csv")

ParlVote_concat <- ParlVote_concat |>
  # Convert the first 10 characters of debate_id to a Date object
  mutate(date = as.Date(substr(debate_id, 1, 10)),
         # Convert motion_speaker_id to numeric
         motion_speaker_id = as.numeric(motion_speaker_id),
         # Convert speaker_id to numeric
         speaker_id = as.numeric(speaker_id)) |>
  # Reorder columns, placing the date column first
  select(date, everything())

Fixing MP’s data:

We need to have a standard way of identifying MP’s names and surnames and also add information on the constituency they are governing in each period. That is in a single variable having the name and surname separated by a comma. Also we need to have clear column names for performing later data analysis. So let´s get the data right for the MPs. For that we will define a function that:

  • Unites the names and surnames of the MPs

  • Renames some columns

# Define unite function |> 
apply_unite <- function(df) {
  df |>
    unite("Full Name", "Last name", "First name", sep = ", ", remove = FALSE)
}

# Function to process each dataset separately
processing <- function(df) {
  df |> 
    #previously defined function to unite names and surnames
    apply_unite() |> 
    #rename some columns
    mutate(speaker_party = tolower(Party)) |> 
    rename(speaker_id = `Person ID`, speaker_name = `Full Name`, 
           speaker_constituency = Constituency) |> 
    select(speaker_id, speaker_name, speaker_constituency)
}

Now we can import the .csv files with the information on the MP’s, apply the function we just created for standarising their names and uniting them in a single dataframe:

# Read all datasets and apply unite function using lapply
mps_1997 <- processing(read_csv("mps (1997).csv"))#1 May 1997
mps_2001 <- processing(read_csv("mps (2001).csv"))#7 Jun 2001
mps_2005 <- processing(read_csv("mps (2005).csv"))#5 May 2005
mps_2010 <- processing(read_csv("mps (2010).csv"))#6 May 2010
mps_2015 <- processing(read_csv("mps (2015).csv"))#7 May 2015
mps_2017 <- processing(read_csv("mps (2017).csv"))#8 June 2017
mps_2019 <- processing(read_csv("mps (2019).csv"))#for posterior changes

all_mps <- bind_rows(mps_1997, mps_2001, mps_2005, mps_2010, mps_2015, mps_2017, mps_2019)

all_mps <- distinct(all_mps)

Adding constituencies to each speaker:

I found that some MP’s were duplicated yet the observation was unique as in each observation they had a different constituency name assigned.

First we will check duplicated values in the speaker_id column to find MP’s duplicated in name but under a different constituency as constituencies change every once in a while.

# Check all the speaker names with more than one ocurrence
all_mps |> 
    group_by(speaker_id) |> 
    filter(n() > 1) |> 
    ungroup()
## # A tibble: 212 × 3
##    speaker_id speaker_name    speaker_constituency                 
##         <dbl> <chr>           <chr>                                
##  1      10029 Bayley, Hugh    City of York                         
##  2      10045 Betts, Clive    Sheffield, Attercliffe               
##  3      10048 Blears, Hazel   Salford                              
##  4      10050 Blunkett, David Sheffield, Brightside                
##  5      10068 Brown, Gordon   Dunfermline East                     
##  6      10069 Brown, Nick     Newcastle upon Tyne East and Wallsend
##  7      10070 Brown, Russell  Dumfries                             
##  8      10075 Buck, Karen     Regent's Park and Kensington North   
##  9      10079 Burns, Simon    West Chelmsford                      
## 10      10116 Clarke, Tom     Coatbridge and Chryston              
## # ℹ 202 more rows

Due to a change in the distribution of constituencies there are 97 MPs duplicated. We can fix this issue if we merge the constituencies by name and by date in case a MP has been elected in different years in different constituencies.

Boundaries changed at some point between the 2005 elections to the 2010 elections. Thus, for this analysis we will keep the later boundaries/district names (those of 2010).

# Create a final list of unique values (MP's and their constituencies) by keeping in case there is a duplicated observation the latest one (2010)
unique_speakers <- all_mps |> 
  arrange(speaker_id) |> 
  group_by(speaker_id) |> 
  filter(row_number() == n())  # Keep only the last occurrence of each `speaker_id`

#check if I only have unique values: 
anyDuplicated(unique_speakers)
## [1] 0

Finally we can have a clean list of all the MPs which we can merge to the original dataframe (ParlVote)

ParlVote_concat <- ParlVote_concat |>
  # Join with the unique_speakers data frame based on speaker_id and speaker_name
  left_join(unique_speakers, by = c("speaker_id", "speaker_name")) |>
  # Move the speaker_constituency column to be right after speaker_name
  relocate(speaker_constituency, .after = speaker_name)

To make sure the merge was succesful we can check NAs:

# Assuming ParlVote_concat is your data frame
na_indices <- is.na(ParlVote_concat$speaker_constituency)

# Locate rows with NAs in `speaker_constituency`
na_rows <- ParlVote_concat[na_indices, ]

#check the names: 
unique(na_rows$speaker_name)
##  [1] "Devine, Jim"             "Austin-Walker, John"    
##  [3] "Mason, John"             "Barclay, Stephen"       
##  [5] "Jamieson, Cathy"         "McKenzie, Iain"         
##  [7] "Sawford, Andy"           "Thornton, Mike"         
##  [9] "Dhesi, Tanmanjeet Singh" "Little Pengelly, Emma"  
## [11] "Dodds, Jane"

There are 11 MPs not identified, let´s fix our data with information from the House of Commons page “Members of Parliament after 1832” (n.d) and create a function to automatically match and fill in the missing data:

# Define a list (dictionary) mapping (unmatched) speaker names to their constituencies
speaker_to_constituency <- list(
    "Devine, Jim" = "Livingston",
    "Austin-Walker, John" = "Erith and Thamesmead",
    "Mason, John" = "Glasgow Shettleston",
    "Barclay, Stephen" = "North East Cambridgeshire",
    "Jamieson, Cathy" = "Kilmarnock and Loudoun",
    "McKenzie, Iain" = "Inverclyde",
    "Sawford, Andy" = "Corby",
    "Thornton, Mike" = "Eastleigh",
    "Dhesi, Tanmanjeet Singh" = "Slough",
    "Little Pengelly, Emma" = "Belfast South",
    "Dodds, Jane" = "Brecon and Radnorshire"
    # Add more speaker names and their constituencies as needed
)

# Fill in the missing values in `speaker_constituency` in `unique_speakers`
for (i in seq_len(nrow(ParlVote_concat))) {
    if (is.na(ParlVote_concat$speaker_constituency[i])) {
        # Get the speaker's name
        speaker_name <- ParlVote_concat$speaker_name[i]
        
        # Check if the speaker's name is in the reference list
        if (speaker_name %in% names(speaker_to_constituency)) {
            # Fill in the missing constituency
            ParlVote_concat$speaker_constituency[i] <- speaker_to_constituency[[speaker_name]]
        }
    }
}

Perform another NA’s check:

sum(is.na(ParlVote_concat$speaker_constituency))
## [1] 0

And now we will seek to do the same for motion speakers (motion_speaker column) to standarise both columns the same way:

unique_speakers <- unique_speakers |> 
  rename( motion_speaker_id = speaker_id, motion_speaker_name = speaker_name, motion_speaker_constituency = speaker_constituency)
ParlVote_concat <- ParlVote_concat |> 
  left_join(unique_speakers, by = c("motion_speaker_id", "motion_speaker_name")) |> 
  relocate(motion_speaker_constituency, .after = motion_speaker_name)
# Assuming ParlVote_concat is your data frame
na_indices <- is.na(ParlVote_concat$motion_speaker_constituency)

# Locate rows with NAs in `speaker_constituency`
na_rows <- ParlVote_concat[na_indices, ]

#check the names: 
unique(na_rows$motion_speaker_name)
## [1] "Jamieson, Cathy"  "Sawford, Andy"    "Barclay, Stephen"

Let´s fix the 3 missing ones (MP’s):

# Define a list (dictionary) mapping speaker names to their constituencies
speaker_to_constituency <- list(
    "Barclay, Stephen" = "North East Cambridgeshire",
    "Jamieson, Cathy" = "Kilmarnock and Loudoun",
    "Sawford, Andy" = "Corby"
)

# Fill in the missing values in `speaker_constituency` in `unique_speakers`
for (i in seq_len(nrow(ParlVote_concat))) {
    if (is.na(ParlVote_concat$motion_speaker_constituency[i])) {
        # Get the speaker's name
        motion_speaker_name <- ParlVote_concat$motion_speaker_name[i]
        
        # Check if the speaker's name is in the reference list
        if (motion_speaker_name %in% names(speaker_to_constituency)) {
            # Fill in the missing constituency
            ParlVote_concat$motion_speaker_constituency[i] <- speaker_to_constituency[[motion_speaker_name]]
        }
    }
}

Finally: check NAs, filter dates and drop useless variables:

sum(is.na(ParlVote_concat$motion_speaker_constituency))
## [1] 0
ParlVote_concat <- ParlVote_concat |> 
  #Drop useless variables
  select(-debate_id, -motion_speaker_name) |> 
  #Only interested in debates after 2013
  filter(date >= as.Date("2013-01-01"))

Save the dataset with MP’s information:

write.csv(ParlVote_concat, "parliament_speeches.csv", row.names = FALSE)
write.xlsx(ParlVote_concat, "ParlVote_full.xlsx")

Assigning topics to each debate

First all of the debate titles the dataset has will be explored:

topics <- ParlVote_concat|> 
  distinct(debate_title, motion_text)

nrow(topics)
## [1] 494

There are 494 debates.

Cleaning the topics with Regular Expressions:

# Case-insensitive removal of "Orders of the Day — " from debate_title
topics$debate_title <- gsub("Orders of the Day — ", "", topics$debate_title, ignore.case = TRUE)

# Remove "Clause + number + —" patterns using regex
topics$debate_title <- str_replace_all(topics$debate_title, "Clause\\s*[0-9]+\\s*—\\s*", "")

# Remove occurrences of " — " from debate_title
topics$debate_title <- str_replace_all(topics$debate_title, " — ", "")

# Remove empty square brackets "[]"
topics$debate_title <- gsub("\\[\\]", "", topics$debate_title)

# Remove "Schedule + number + —" patterns using regex
topics$debate_title <- str_replace_all(topics$debate_title, "Schedule\\s*[0-9]+\\s*—\\s*", "")

# Case-insensitive removal of "Unallotted Half-Day"
topics$debate_title <- gsub("Unallotted Half-Day", "", topics$debate_title, ignore.case = TRUE)

# Remove patterns like "[1st Allotted Day]" using regex
topics$debate_title <- gsub("\\[\\d{1,2}(?:st|nd|rd|th) Allotted Day\\]", " ", topics$debate_title)

# Case-insensitive removal of "Un-allotted Half Day"
topics$debate_title <- gsub("Un-allotted Half Day", "", topics$debate_title, ignore.case = TRUE)

Save the cleaned topics dataframe as a .csv to be able to start off the analysis from this point as well.

#Save the cleaned topics df as a csv
write.csv(topics, "topics.csv", row.names = FALSE)

Once we have the topics cleaned we will create a function with predefined topics to which the different speeches will be assigned if they contain any of the keywords described for each topic. This function assigns topics in a specific order, meaning that the text is analysed and if it does not contain any of the words of the first topic the function proceeds to check if it contains any words of the second topic and so on until a topic is assigned. Thus we can´t be neither too vague nor too specific with the keywords:

# Function to categorize each debate title based on keywords
categorize_debate <- function(title, motion) {
  title <- tolower(title)
  motion <- tolower(motion)
  
  # Define category keywords
  eu_keywords <- c('eu', 'european union', 'brexit', 'referendum', 'single market', 'remain', 'leave')
  education_keywords <- c('education', 'school', 'university', 'exam', 'student', 'teacher', 'childcare', 'research')
  energy_keywords <- c('energy', 'environment', 'climate', 'fuel', 'renewable', 'green', 'decarbonisation', 'methanol', 'turbine', 'wind')
  welfare_keywords <- c('social security', 'welfare', 'benefit', 'pension', 'support', 'public service', 'community', 'local', 'council', 'faith', 'constituencies', 'marriage', 'partnership', 'administration', 'child','rehabilitation', 'hmrc', 'colleges', 'domestic' )
  transport_keywords <- c('transport', 'rail', 'road', 'bus', 'train', 'whiplash')
  health_keywords <- c('health', 'nhs', 'hospital', 'doctor', 'patient', 'medical', 'cannabis', 'dying', 'mesothelioma', 'drugs', 'abortion')
  economy_keywords <- c('economy', 'finance', 'budget', 'economic', 'market', 'technology', 'innovation', 'tech', 'digital', 'internet', 'growth', 'infrastructure', 'anti-tax', 'tax', 'vat', 'financial', 'income', 'revenue', 'donations', 'expenditure', 'intellectual', 'taxation', 'bank', 'trade', 'consumer', 'ivory', 'customs', 'leverage', 'allowance', 'tfl')
  housing_keywords <- c('housing', 'home', 'rent', 'landlord', 'tenant', 'dwelling', 'house', 'first-time')
  legal_crime_keywords <- c('crime', 'unlawful', 'justice', 'police', 'criminal', 'court', 'equality', 'rights', 'defamation', 'dog', 'exemplary', 'injuries')
  employment_keywords <- c('employment', 'job', 'work', 'worker', 'labor', 'enrolment', 'employee-ownership', 'self-employed', 'employment', 'employees', 'internships')
  foreign_keywords <- c('foreign', 'international', 'diplomacy', 'relations', 'defense', 'military', 'army', 'navy', 'security', 'immigration', 'gibraltar', 'bulk', 'interception', 'safety', 'overseas', 'picketing', 'park', 'nca', 'slavery', 'entering')
  

  # Check for category keywords in title and motion and assign the topic
  if (any(sapply(eu_keywords, grepl, title)) || any(sapply(eu_keywords, grepl, motion))) {
    return('EU/Brexit')
  } else if (any(sapply(education_keywords, grepl, title)) || any(sapply(education_keywords, grepl, motion))) {
    return('Education')
  } else if (any(sapply(energy_keywords, grepl, title)) || any(sapply(energy_keywords, grepl, motion))) {
    return('Energy/Environment')
  } else if (any(sapply(employment_keywords, grepl, title)) || any(sapply(employment_keywords, grepl, motion))) {
    return('Employment')
  } else if (any(sapply(welfare_keywords, grepl, title)) || any(sapply(welfare_keywords, grepl, motion))) {
    return('Social Security/Society/Welfare')
  } else if (any(sapply(transport_keywords, grepl, title)) || any(sapply(transport_keywords, grepl, motion))) {
    return('Transport')
  } else if (any(sapply(health_keywords, grepl, title)) || any(sapply(health_keywords, grepl, motion))) {
    return('Health')
  } else if (any(sapply(economy_keywords, grepl, title)) || any(sapply(economy_keywords, grepl, motion))) {
    return('Economy/Technology')
  } else if (any(sapply(housing_keywords, grepl, title)) || any(sapply(housing_keywords, grepl, motion))) {
    return('Housing')
  } else if (any(sapply(legal_crime_keywords, grepl, title)) || any(sapply(legal_crime_keywords, grepl, motion))) {
    return('Crime/Law')
  } else if (any(sapply(foreign_keywords, grepl, title)) || any(sapply(foreign_keywords, grepl, motion))) {
    return('Defense/Foreign Affairs')
  } else {
    return('Parliamentary proceedings')
  }
}

# Apply the categorization function to each row
topics$category <- mapply(categorize_debate, topics$debate_title, topics$motion_text)

# Display the distribution of categories
table(topics$category)
## 
##                       Crime/Law         Defense/Foreign Affairs 
##                               7                              11 
##              Economy/Technology                       Education 
##                              57                              34 
##                      Employment              Energy/Environment 
##                              42                              27 
##                       EU/Brexit                          Health 
##                             142                              13 
##                         Housing       Parliamentary proceedings 
##                              56                              41 
## Social Security/Society/Welfare                       Transport 
##                              52                              12
#drop obervations that have no category assigned (I have checked and it´s the one: "Clause 2  —  Meaning of consultant lobbying")
topics <- topics |> 
  filter(!is.na(category))

To now add the categories to our main ParlVote dataset:

  • First, perform the same cleaning on the debate_title variable as it was done before with the topics dataframe

  • Second, merge the dataset by the now cleaned debate_title variable

First:

# Case-insensitive removal of "Orders of the Day — " from debate_title
ParlVote_concat$debate_title <- gsub("Orders of the Day — ", "", ParlVote_concat$debate_title, ignore.case = TRUE)

# Remove "Clause + number + —" patterns using regex
ParlVote_concat$debate_title <- str_replace_all(ParlVote_concat$debate_title, "Clause\\s*[0-9]+\\s*—\\s*", "")

# Remove occurrences of " — " from debate_title
ParlVote_concat$debate_title <- str_replace_all(ParlVote_concat$debate_title, " — ", "")

# Remove empty square brackets "[]"
ParlVote_concat$debate_title <- gsub("\\[\\]", "", ParlVote_concat$debate_title)

# Remove "Schedule + number + —" patterns using regex
ParlVote_concat$debate_title <- str_replace_all(ParlVote_concat$debate_title, "Schedule\\s*[0-9]+\\s*—\\s*", "")

# Case-insensitive removal of "Unallotted Half-Day"
ParlVote_concat$debate_title <- gsub("Unallotted Half-Day", "", ParlVote_concat$debate_title, ignore.case = TRUE)

# Remove patterns like "[1st Allotted Day]" using regex
ParlVote_concat$debate_title <- gsub("\\[\\d{1,2}(?:st|nd|rd|th) Allotted Day\\]", " ", ParlVote_concat$debate_title)

# Case-insensitive removal of "Un-allotted Half Day"
ParlVote_concat$debate_title <- gsub("Un-allotted Half Day", "", ParlVote_concat$debate_title, ignore.case = TRUE)

Second: Merge the topics dataset to the ParlVote and create a merged dataframe that is ready to be used with Python for Natural Language Processing:

topics_unique <- topics |> 
  distinct(debate_title, motion_text, .keep_all = TRUE)

python_df <- ParlVote_concat |> 
  left_join(topics_unique, by = c("debate_title", "motion_text")) |> 
  relocate(category, .after = debate_title)

Carry out a final check of NA’s and save the dataset:

sum(is.na(python_df$category))
## [1] 0
# Write the updated dataframe to a new CSV file
write.csv(python_df, 'python_df.csv', row.names = FALSE)

Speech sentiment analysis and Brexit vote

Research question: Does the sentiment attached to a MP´s speech shape the opinions and vote of the people in its constituency (Brexit case)?

The objective is to perform a sentiment analysis on each of the interventions of each MP of each of the debates covered and to assign a column with a tag (positive/negative/neutral) and a score for such tag.

The regular RoBERTa model was chosen due to different test being performed on the demo which showed that this model classified better such speeches than the RoBERTa specifically trained on political tweets:

| eval = False
# Installing Hugging Face Libraries
!pip install --upgrade pip
!pip install transformers 
!pip install torch 
!pip install pandas 
!pip install tqdm 
!pip install wordcloud
!pip install seaborn
!pip install plotly
!pip install nltk
!pip install datasets
!pip install transformers[torch] accelerate -U

Import libraries and silence warnings

| eval = False
import torch
import os
import pandas as pd
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig, Trainer, TrainingArguments
import numpy as np
from scipy.special import softmax
from tqdm import tqdm  # For progress bar
from sklearn.model_selection import train_test_split
from datasets import Dataset

import warnings
warnings.filterwarnings('ignore')

# Suppress the parallelism warning from tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In the following chunk the following steps are taken:

  1. Loading the dataset
  2. Defining a preprocessing function that converts text to lowercase
  3. Loading RoBERTa model, tokenizer and configuration
  4. Defining a function to get the sentiment and score for a given text as well as the same function but for longer texts (>512 tokens)
  5. Tokenizing all texts first to check their lengths before applying the model and the functions
  6. Apply the function to each row in the “speech” column with a timer that shows progress every 100 rows processed
  7. Adding the results to new columns
  8. Saving the resulting dataset
| eval = False
# Step 1: Load your dataset
df = pd.read_csv("/Users/lauramartinez/Desktop/Master/TFM/python_df.csv")

# Step 2: Define the preprocessing function
def preprocess(text):
    text = text.lower()  # Convert to lowercase
    return text

# Step 3: Load the model and tokenizer
MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

# Ensure model is using GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Step 4: Define a function to get the sentiment and score for a given text
def get_sentiment_score(text):
    text = preprocess(text)
    # 'return_tensors' sets returned tokenized as PyTorch tensors
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512).to(device)
    # Disable gradient calculation, as we're performing inference (not training)
    with torch.no_grad():
        # Pass the tokenized input through the model to get the output
        output = model(**encoded_input)
    # Extract the output logits (raw prediction scores) from the model output
    scores = output[0][0].cpu().numpy()
    # Apply the softmax function to convert logits to probabilities
    scores = softmax(scores)
    # Get the indices of the scores sorted in descending order
    ranking = np.argsort(scores)[::-1]
    # Map the highest scoring index to the corresponding sentiment label
    sentiment = config.id2label[ranking[0]]
    # Retrieve the highest score, which corresponds to the predicted sentiment
    score = scores[ranking[0]]
    # Return the predicted sentiment label and its corresponding score as a tuple
    return sentiment, float(score)


# Step 4b: Define a function to handle long texts by splitting into chunks
def get_sentiment_score_for_long_text(text):
    text = preprocess(text)
    # Tokenize the preprocessed text into individual tokens
    tokens = tokenizer.tokenize(text)
    # Define the maximum size of each token chunk
    chunk_size = 512
    # Calculate the number of chunks needed to process the entire text
    num_chunks = (len(tokens) + chunk_size - 1) // chunk_size

    # Initialize an array to store cumulative sentiment scores for all chunks
    all_scores = np.zeros(len(config.id2label))

    # Loop through each chunk of tokens
    for i in range(num_chunks):
        # Extract the current chunk of tokens
        chunk_tokens = tokens[i * chunk_size:(i + 1) * chunk_size]
        # Convert the chunk of tokens back into a string
        chunk_text = tokenizer.convert_tokens_to_string(chunk_tokens)
        # Encode the chunked text into tensors suitable for the model
        encoded_input = tokenizer(chunk_text, return_tensors='pt', truncation=True, max_length=chunk_size).to(device)
        # Run the model on the encoded input without computing gradients (faster inference)
        with torch.no_grad():
            output = model(**encoded_input)
      # Extract the output scores from the model and convert to a NumPy array
        scores = output[0][0].cpu().numpy()
        # Apply softmax to convert raw scores into probabilities
        scores = softmax(scores)
        # Accumulate the scores for each chunk
        all_scores += scores

    # Average the scores across all chunks to get final sentiment scores
    all_scores /= num_chunks
    # Get the index of the highest score (most likely sentiment label)
    ranking = np.argsort(all_scores)[::-1]
    # Retrieve the sentiment label corresponding to the highest score
    sentiment = config.id2label[ranking[0]]
    # Get the highest score itself (confidence level for the sentiment)
    score = all_scores[ranking[0]]
    # Return the predicted sentiment label and its confidence score
    return sentiment, float(score)

# Step 5: Tokenize all texts first to check their lengths
token_lengths = df['speech'].apply(lambda x: len(tokenizer.tokenize(preprocess(x))))


# Step 6: Apply the function to each row in the "speech" column
sentiments = []
scores = []

import time
start_time = time.time()

# Iterate over each speech and its corresponding token length from the DataFrame
for idx, (speech, length) in enumerate(zip(df['speech'], token_lengths)):
    # If the speech is longer than 512 tokens, use function for long texts
    if length > 512:
        sentiment, score = get_sentiment_score_for_long_text(speech)
    #If not, use the standard sentiment analysis function
    else:
        sentiment, score = get_sentiment_score(speech)
    # Append the predicted sentiment to the sentiments list
    sentiments.append(sentiment)
    # Append the associated sentiment score to the scores list
    scores.append(score)
    
    # Log progress every 100 rows
    if (idx + 1) % 100 == 0:
        elapsed_time = time.time() - start_time
        print(f"Processed {idx + 1}/{len(df)} rows. Elapsed time: {elapsed_time:.2f} seconds.")


# Step 7: Add the results to new columns
df['sentiment'] = sentiments
df['score'] = scores

# Save the updated DataFrame to a new CSV file
df.to_csv('/Users/lauramartinez/Desktop/Master/TFM/TFM_final/sentiment_python_df.csv', index=False)

# Optional: Display the first few rows to verify
print(df.head())

Sentiment analysis on R

Call the saved dataset in R:

python_df <- read_csv('/Users/lauramartinez/Desktop/Master/TFM/TFM_final/sentiment_python_df.csv')

Add Brexit results (House of Commons Library, 2017) to the ParlVote dataframe:

referendum <- read_excel("eureferendum_constituency.xlsx")
referendum <- referendum |> 
  rename(speaker_constituency = Constituency) |> 
  select(-`ONS ID`) |> 
  mutate(brexit = ifelse(leave_perc > 0.5, 1, 0))

In order to perform sentiment analysis and extract conclusions it is best to select the variables that will be used for the analysis and create a different dataframe:

# Select variables
brexit_df <- python_df |> 
  select(date, category, speaker_constituency, party, speech, sentiment, score)

# Add the referendum data
brexit_df <- brexit_df |> 
  inner_join(referendum, by = "speaker_constituency")

# Ensure the date column is in Date format
brexit_df$date <- as.Date(brexit_df$date)

# Convert sentiment to a factor
brexit_df$sentiment <- factor(brexit_df$sentiment)

Bar plot of sentiment distribution

After the dataframe is finally ready, we might be interested in exploring the sentiment distribution of the speeches:

# Calculate relative frequencies
brexit_df_relative <- brexit_df |> 
  group_by(sentiment) |> 
  summarise(count = n()) |> 
  mutate(relative_count = count / sum(count))

# Plot the relative distribution of sentiment
ggplot(brexit_df_relative, aes(x = sentiment, y = relative_count, fill = sentiment)) + 
  geom_bar(stat = "identity", color = "white") +
  scale_fill_manual(values = c("neutral" = "cadetblue2", "positive" = "#76EE00", "negative" = "red2")) +
  theme_minimal(base_size = 15) + # Increase base font size for readability
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 18), # Bold title, center, increase size
    axis.title.x = element_text(face = "bold", size = 14), # Bold x-axis label
    axis.title.y = element_text(face = "bold", size = 14), # Bold y-axis label
    axis.text = element_text(size = 12), # Size for axis tick labels
    legend.title = element_blank(), # Remove legend title for a cleaner look
    legend.position = "top", # Position legend at the top
    legend.text = element_text(size = 12) # Size for legend text
  ) +
  labs(
    title = "Relative Distribution of Sentiment",
    x = "Sentiment",
    y = "Relative Frequency"
  )

Sentiment distribution timeline

Also the distribution of sentiment through time was plotted:

# Calculate relative frequencies
relative_sentiment <- brexit_df |> 
  group_by(date, sentiment) |> 
  summarise(count = n()) |> 
  ungroup() |> 
  group_by(date) |> 
  mutate(relative_count = count / sum(count)) |> 
  ungroup()

# Plot the relative counts over time
ggplot(relative_sentiment, aes(x = date, y = relative_count, fill = sentiment)) + 
  geom_area(position = "fill", color = "white", alpha = 0.6) +  # Use geom_area for stacked plot
  scale_fill_manual(values = c("neutral" = "cadetblue2", "positive" = "#76EE00", "negative" = "red2")) +
  theme_minimal(base_size = 15) +  # Increase base font size for readability
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 18),  # Bold title, center, increase size
    axis.title.x = element_text(face = "bold", size = 14),  # Bold x-axis label
    axis.title.y = element_text(face = "bold", size = 14),  # Bold y-axis label
    axis.text = element_text(size = 12),  # Size for axis tick labels
    legend.title = element_blank(),  # Remove legend title for a cleaner look
    legend.position = "top",  # Position legend at the top
    legend.text = element_text(size = 12)  # Size for legend text
  ) +
  labs(
    title = "Relative Distribution of Sentiment Over Time",
    x = "Date",
    y = "Relative Frequency"
  )

Average sentiment per topic

Once the categories for each speech have been created and their sentiment retrieved, let´s see which are the category average sentiment scores depending on whether the constituency voted Brexit or not. In this case no date filter is used because we are also interested in seeing the sentiment generated after the Brexit around each topic.

# 1. Group by category and brexit, then compute the average sentiment score for each sentiment
category_brexit_summary <- brexit_df |> 
  group_by(category, brexit, sentiment) |> 
  summarise(avg_score = mean(score, na.rm = TRUE), .groups = 'drop') |> 
  ungroup()

Create a Brexit and No-Brexit profoile based on the topics created:

# Filter for the sentiment with the highest avg_score for each category and brexit
top_sentiment_summary <- category_brexit_summary |>
  group_by(category, brexit) |>
  filter(avg_score == max(avg_score)) |>
  ungroup()

And visualize it:

# Define colour palette
sentiment_colors <- c("positive" = "chartreuse2", "neutral" = "cadetblue2", "negative" = "red2")

# Create a bubble chart
ggplot(top_sentiment_summary, aes(x = factor(brexit), y = reorder(category, avg_score), size = avg_score, color = sentiment)) +
  geom_point(alpha = 0.7) +
  scale_size_continuous(range = c(3, 15)) +  # Adjust the range to change bubble sizes
  scale_x_discrete(labels = c("0" = "NO", "1" = "YES")) +  # Rename the X-Axis labels
  scale_color_manual(values = sentiment_colors) +  # Apply the color palette
  theme_minimal() +
  labs(title = "Top Sentiment by Category and Brexit Status",
       x = "Brexit Status",
       y = "Category",
       size = "Average Score",
       color = "Sentiment") +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5),
        axis.title.y = element_blank(),
        plot.title = element_text(size = 14, face = "bold"),
        strip.text = element_text(size = 12))

Average sentiment per party

Taking the last party in government before Brexit as the party of reference

# Ensure party is a factor
brexit_df$party <- as.factor(brexit_df$party)

# Group by category and party, then compute the average sentiment score for each sentiment
category_party_summary2 <- brexit_df |> 
  filter(format(date, "%Y") == "2016") |> 
  group_by(category, party, sentiment) |> 
  summarise(avg_score = mean(score, na.rm = TRUE), .groups = 'drop') |> 
  ungroup()

# Filter for the sentiment with the highest avg_score for each category and party
party_sentiment_summary <- category_party_summary2 |>
  group_by(category, party) |>
  filter(avg_score == max(avg_score)) |>
  ungroup()

We plot the party average sentiment by topic:

# Create a bubble chart
ggplot(party_sentiment_summary, aes(x = party, y = reorder(category, avg_score), size = avg_score, color = sentiment)) +
  geom_point(alpha = 0.7) +
  scale_size_continuous(range = c(3, 15)) +  # Adjust the range to change bubble sizes
  scale_color_manual(values = sentiment_colors) +  # Apply the color palette
  theme_minimal() +
  labs(title = "Top Sentiment by Category and Party",
       x = "Party",
       y = "Category",
       size = "Average Score",
       color = "Sentiment") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.y = element_blank(),
        plot.title = element_text(size = 14, face = "bold"),
        strip.text = element_text(size = 12))

Merge datasets on category to include sentiments of Pro and Anti-Brexit profiles as a baseline for y-axis lines:

# Merge datasets on category to include sentiments for y-axis lines
merged_data <- party_sentiment_summary |>
  left_join(top_sentiment_summary, by = "category", suffix = c("_party", "_axis"), relationship = "many-to-many")

Then the two plots for the Pro-Brexit and Anti-Brexit profiles are created:

# Create the plot for brexit = 0
plot_brexit_0 <- ggplot(merged_data |> filter(brexit == 0), 
                        aes(x = party, y = reorder(category, avg_score_party), size = avg_score_party, color = sentiment_party)) +
  geom_point(alpha = 0.7) +
  scale_size_continuous(range = c(3, 15)) +
  scale_color_manual(values = sentiment_colors) +
  theme_minimal() +
  labs(title = "Top Sentiment by Category and Party (NO Brexit)",
       x = "Party",
       y = "Category",
       size = "Average Score",
       color = "Sentiment") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.y = element_blank(),
        plot.title = element_text(size = 14, face = "bold"),
        strip.text = element_text(size = 12)) +
  geom_hline(aes(yintercept = category, color = sentiment_axis), size = 0.2)

# Create the plot for brexit = 1
plot_brexit_1 <- ggplot(merged_data |> filter(brexit == 1), 
                        aes(x = party, y = reorder(category, avg_score_party), size = avg_score_party, color = sentiment_party)) +
  geom_point(alpha = 0.7) +
  scale_size_continuous(range = c(3, 15)) +
  scale_color_manual(values = sentiment_colors) +
  theme_minimal() +
  labs(title = "Top Sentiment by Category and Party (Brexit)",
       x = "Party",
       y = "Category",
       size = "Average Score",
       color = "Sentiment") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.y = element_blank(),
        plot.title = element_text(size = 14, face = "bold"),
        strip.text = element_text(size = 12)) +
  geom_hline(aes(yintercept = category, color = sentiment_axis), size = 0.2)

# Print the plots
print(plot_brexit_0)

print(plot_brexit_1)

All results depend on whether I create the average score using previous data or just 2016 data

Main party governing

I need to first find the pictures of the logos:

Now we will see which party was the one most frequently in power in Pro-Brexit and Anti-Brexit areas:

#Use as reference party the one governing in 2016
party_brexit_summary <- brexit_df |> 
  filter(format(date, "%Y") == "2016") |> 
  distinct(speaker_constituency, brexit, party) |>  # Ensure each constituency is counted only once
  group_by(brexit, party) |> 
  summarise(constituency_count = n(), .groups = 'drop') |>  # Count unique constituencies per brexit and party
  mutate(total_constituencies = n_distinct(brexit_df$speaker_constituency),  # Total number of constituencies
         relative_freq = constituency_count / total_constituencies) |>  # Calculate relative frequency based on total constituencies
  arrange(brexit, desc(relative_freq)) |> 
  group_by(brexit) |> 
  slice_max(order_by = relative_freq, n = 5, with_ties = FALSE)

print(party_brexit_summary)
## # A tibble: 10 × 5
## # Groups:   brexit [2]
##    brexit party            constituency_count total_constituencies relative_freq
##     <dbl> <fct>                         <int>                <int>         <dbl>
##  1      0 labour                           58                  639       0.0908 
##  2      0 conservative                     57                  639       0.0892 
##  3      0 scottish-nation…                 49                  639       0.0767 
##  4      0 labourco-operat…                 13                  639       0.0203 
##  5      0 liberal-democrat                  6                  639       0.00939
##  6      1 conservative                    188                  639       0.294  
##  7      1 labour                          112                  639       0.175  
##  8      1 labourco-operat…                  8                  639       0.0125 
##  9      1 dup                               5                  639       0.00782
## 10      1 liberal-democrat                  2                  639       0.00313

We must define the dierctory in which we have the images we previously downloaded and stored:

logo_dir <- "~/Desktop/Master/TFM/TFM_final" # Change this to your working directory

# Create a dataframe with the names of the parties and the corresponding paths to the logo files
party_logos <- data.frame(
  party = c("scottish-national-party", "labour", "conservative", "labourco-operative", "liberal-democrat", "dup"),
  logo = file.path(logo_dir, c(
    "scottish-national-party.png",
    "labour.png",
    "conservative.png",
    "labourco-operative.png",
    "liberal-democrat.png",
    "dup.png"
  ))
)

# Create a function to resize the images
resize_image <- function(image_path, width = 120, height = 80) {
  # Read the image from the specified file path
  image <- image_read(image_path)
  # Resize the image to the specified W & H, forcing the exact dimensions
  image <- image_resize(image, paste0(width, "x", height, "!"))
  # Create a temporary file with a .png extension to store the resized image
  temp_file <- tempfile(fileext = ".png")
  # Write the resized image to the temporary file in PNG format
  image_write(image, path = temp_file, format = "png")
  # Return the path to the temporary file containing the resized image
  return(temp_file)
}

party_logos$logo <- sapply(party_logos$logo, resize_image)

# Merge the logos df with party_brexit_summary
party_brexit_summary_with_logos <- party_brexit_summary |>
  left_join(party_logos, by = "party")

Prepare the data for the two different sides (pro-Brexit and anti-Brexit) and set a colour palette for the most common parties in the House of Commons:

# Filter data for Brexit and No Brexit plots
brexit_data <- party_brexit_summary_with_logos |> filter(brexit == 1)
no_brexit_data <- party_brexit_summary_with_logos |> filter(brexit == 0)

party_colors <- c(
  "labour" = "red3",
  "conservative" = "dodgerblue3",
  "scottish-national-party" = "lemonchiffon",
  "labourco-operative" = "darkmagenta",
  "dup" = "darksalmon",
  "liberal-democrat" = "goldenrod1"
)

Plot the results:

# Determine the common y-axis limit
y_max <- max(c(max(brexit_data$relative_freq+ 0.03), max(no_brexit_data$relative_freq)))

# Create Brexit plot with fixed y-axis limits
brexit_plot <- ggplot(brexit_data, aes(x = reorder(party, relative_freq), y = relative_freq, fill = party)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  geom_image(aes(image = logo, y = relative_freq + 0.015), position = position_dodge(width = 0.7), size = 0.1) +
  scale_fill_manual(values = party_colors) +
  theme_minimal() +
  labs(
    title = "",
    x = "Brexit areas",
    y = "Proportion of Electoral representation in Brexit districts"
  ) +
  theme(
    axis.text.x = element_blank(),
    axis.title.x = element_text(size = 12, face = "bold"), 
    legend.position = "none",
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_line(linetype = "dashed"), 
    panel.grid.minor.y = element_line(linetype = "dashed")
  ) +
  ylim(0, y_max)

# Create No Brexit plot with fixed y-axis limits
no_brexit_plot <- ggplot(no_brexit_data, aes(x = reorder(party, relative_freq), y = relative_freq, fill = party)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  geom_image(aes(image = logo, y = relative_freq + 0.015), position = position_dodge(width = 0.7), size = 0.1) +
  scale_fill_manual(values = party_colors) +
  theme_minimal() +
  labs(
    title = "",
    x = "No Brexit areas",
    y = ""
  ) +
  theme(
    axis.text.x = element_blank(),
    axis.title.x = element_text(size = 12, face = "bold"), 
    legend.position = "none",
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_line(linetype = "dashed"), 
    panel.grid.minor.y = element_line(linetype = "dashed")
  ) +
  ylim(0, y_max)

# Manually create the legend
legend_plot <- ggplot(data.frame(party = names(party_colors), count = rep(1, length(party_colors))), 
                      aes(x = party, y = count, fill = party)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = party_colors) +
  theme_void() +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 12, face = "bold"),
    legend.text = element_text(size = 10)
  ) +
  guides(fill = guide_legend(title = "Party"))

# Extract the legend as a separate plot
legend <- ggplotGrob(legend_plot)$grobs[[which(sapply(ggplotGrob(legend_plot)$grobs, function(x) x$name) == "guide-box")]]

# Combine the two plots with a common title
combined_plot <- plot_grid(
  brexit_plot, no_brexit_plot,
  ncol = 2, rel_widths = c(1, 1)
)

# Add the common title and combine with the legend
final_plot <- plot_grid(
  ggdraw() +
    draw_plot(combined_plot, 0.1, 0, 0.9, 0.9) +
    draw_label("Top 5 Parties by Brexit Result in 2016", x = 0.6, y = 0.93, fontface = 'bold', size = 20),
  legend,
  ncol = 2,
  rel_widths = c(0.85, 0.15)
)

# Display the final plot
print(final_plot)

Correlation between the main party governing before Brexit and score on each of the topics

I will create a score for each topic in terms of brexit or No brexit. Then see the impact of having one party governing in a constituency or another in showing a profile closer to the Brexit or No brexit status.

Sentiment needs to be from 2013 to capture the ideas/c concepts of the parties regarding brexit but only the representation in 2016 (party in power in 2016) might have a correltion. This can be maybe because even the electoral win of those MPs in certain areas was forecoming the brexit results as well.

First: Average sentiment per topic per party:

# Calculate average sentiment per topic per party
avg_sentiment <- brexit_df |> 
  group_by(category, party, sentiment) |> 
  summarize(avg_score = mean(score, na.rm = TRUE)) |> 
  ungroup()

# Select the sentiment with the highest score for each topic per party
max_sentiment <- avg_sentiment |> 
  group_by(category, party) |> 
  filter(avg_score == max(avg_score)) |> 
  ungroup() |> 
  distinct(category, party, .keep_all = TRUE)

Filter for 2016 and Calculate Average Leave Percentage for each party based on brexit status

# Filter for the year 2016 and select relevant columns
df_2016 <- brexit_df |> 
  filter(format(as.Date(date, format="%Y-%m-%d"), "%Y") == "2016") |> 
  select(party, brexit, leave_perc)

# Calculate the average leave percentage for each party based on Brexit status
avg_leave_perc <- df_2016 |> 
  group_by(party, brexit) |> 
  summarize(avg_leave_perc = mean(leave_perc, na.rm = TRUE)) |> 
  ungroup()

Merge data frames:

# Merge the two data frames on the party column
final_df <- left_join(max_sentiment, avg_leave_perc, by = "party", relationship = "many-to-many")

# Display the resulting dataframe structure
str(final_df)
## tibble [218 × 6] (S3: tbl_df/tbl/data.frame)
##  $ category      : chr [1:218] "Crime/Law" "Crime/Law" "Crime/Law" "Crime/Law" ...
##  $ party         : Factor w/ 14 levels "alliance","conservative",..: 2 2 4 5 6 6 7 7 8 8 ...
##  $ sentiment     : Factor w/ 3 levels "negative","neutral",..: 3 3 2 1 3 3 2 2 3 3 ...
##  $ avg_score     : num [1:218] 0.668 0.668 0.749 0.638 0.627 ...
##  $ brexit        : num [1:218] 0 1 0 0 0 1 0 1 0 1 ...
##  $ avg_leave_perc: num [1:218] 0.421 0.581 0.259 0.426 0.375 ...

Test the correlation/causation:

Let´s carry out a few ANOVA preliminary tests:

To see if there are any significant differences between between different parties in the average (leave) Brexit result:

# Convert party to a factor
final_df$party <- as.factor(final_df$party)

# Perform ANOVA
anova_party <- aov(avg_leave_perc ~ party, data = final_df)
summary(anova_party)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## party        10  1.135 0.11348   19.96 <2e-16 ***
## Residuals   196  1.114 0.00569                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 11 observations deleted due to missingness

And plot the results in a table:

# Extract the ANOVA summary as a data frame
anova_summary <- summary(anova_party)[[1]]

# Convert the summary table into a data frame with meaningful row names
anova_df <- data.frame(
  Term = c("party", "Residuals"),
  Df = anova_summary[, "Df"],
  `Sum Sq` = round(anova_summary[, "Sum Sq"], 3), # Rounded for better display
  `Mean Sq` = round(anova_summary[, "Mean Sq"], 5),
  `F value` = round(anova_summary[, "F value"], 2),
  `Pr(>F)` = format.pval(anova_summary[, "Pr(>F)"], digits = 3, eps = 0.001)
)

# Add significance codes manually
anova_df$Signif <- c("***", "")

# Create the table using gt
anova_gt_table <- anova_df |>
  gt() |>  # Convert the data frame into a gt table object
  tab_header(
    title = "ANOVA Summary Table"  # Add a title to the table
  ) |>
  cols_label(
    Term = "Term",  # Rename the 'Term' column
    Df = "Degrees of Freedom",  # Rename the 'Df' column
    Sum.Sq = "Sum of Squares",  # Correctly rename the 'Sum.Sq' column
    Mean.Sq = "Mean Square",  # Correctly rename the 'Mean.Sq' column
    F.value = "F Value",  # Correctly rename the 'F.value' column
    Pr..F. = "p-value",  # Correctly rename the 'Pr..F.' column
    Signif = "Signif. Codes"  # Rename the 'Signif' column
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),  # Bold style for the column headers
    locations = cells_column_labels(everything())  
  ) |>
  tab_footnote(
    # Add a footnote explaining significance codes
    footnote = "Significance codes: *** p < 0.001",
    # Place the footnote under the 'Signif. Codes' header
    locations = cells_column_labels("Signif") 
  )

# Display the table
anova_gt_table  # Render the gt table in the output
ANOVA Summary Table
Term Degrees of Freedom Sum of Squares Mean Square F Value p-value Signif. Codes1
party 10 1.135 0.11348 19.96 <0.001 ***
Residuals 196 1.114 0.00569 NA NA
1 Significance codes: *** p < 0.001

To determine if the sentiments expressed in speeches of each category by each party have a significant direct effect on the average leave percentage (i.e., testing for a triple interaction) we can do a multi-way ANOVA

final_df$sentiment <- as.factor(final_df$sentiment)
final_df$category <- as.factor(final_df$category)

# Perform a multi-way ANOVA with triple interaction
anova_multi <- aov(avg_leave_perc ~ party * sentiment * category, data = final_df)
summary(anova_multi)
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## party           10  1.135 0.11348    8.35 3.33e-09 ***
## sentiment        2  0.000 0.00000    0.00        1    
## category        11  0.000 0.00000    0.00        1    
## party:sentiment 19  0.000 0.00000    0.00        1    
## party:category  82  0.000 0.00000    0.00        1    
## Residuals       82  1.114 0.01359                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 11 observations deleted due to missingness

And plot the table:

# Extract the ANOVA summary as a data frame
anova_multi_summary <- summary(anova_multi)[[1]]

# Convert the summary table into a data frame with meaningful row names
anova_multi_df <- data.frame(
  Term = rownames(anova_multi_summary),
  Df = anova_multi_summary[, "Df"],
  `Sum Sq` = round(anova_multi_summary[, "Sum Sq"], 3),  # Rounded for better display
  `Mean Sq` = round(anova_multi_summary[, "Mean Sq"], 5),
  `F value` = round(anova_multi_summary[, "F value"], 2),
  `Pr(>F)` = format.pval(anova_multi_summary[, "Pr(>F)"], digits = 3, eps = 0.001)
)

# Add significance codes manually based on the results
anova_multi_df$Signif <- c("***", "", "", "", "", "")

# Create the table using gt
anova_multi_gt_table <- anova_multi_df |>
  gt() |>  # Convert the data frame into a gt table object
  tab_header(
    title = "Multi-Way ANOVA Summary Table"  # Add a title to the table
  ) |>
  cols_label(
    Term = "Term",  # Rename the 'Term' column
    Df = "Degrees of Freedom",  # Rename the 'Df' column
    Sum.Sq = "Sum of Squares",  # Correctly rename the 'Sum.Sq' column
    Mean.Sq = "Mean Square",  # Correctly rename the 'Mean.Sq' column
    F.value = "F Value",  # Correctly rename the 'F.value' column
    Pr..F. = "p-value",  # Correctly rename the 'Pr..F.' column
    Signif = "Signif. Codes"  # Rename the 'Signif' column
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),  # Bold style for the column headers
    locations = cells_column_labels(everything())  # Apply to all column headers
  ) |>
  tab_footnote(
    # Add a footnote explaining significance codes
    footnote = "Significance codes: *** p < 0.001",  
    # Place the footnote under the 'Signif. Codes' header
    locations = cells_column_labels("Signif")  
  )

# Display the table
anova_multi_gt_table  # Render the gt table in the output
Multi-Way ANOVA Summary Table
Term Degrees of Freedom Sum of Squares Mean Square F Value p-value Signif. Codes1
party 10 1.135 0.11348 8.35 <0.001 ***
sentiment 2 0.000 0.00000 0.00 1
category 11 0.000 0.00000 0.00 1
party:sentiment 19 0.000 0.00000 0.00 1
party:category 82 0.000 0.00000 0.00 1
Residuals 82 1.114 0.01359 NA NA
1 Significance codes: *** p < 0.001

The results indicate that while the party of the speaker significantly influences the average leave percentage, the combined effects of party, sentiment, and category do not have a significant impact. This suggests that the influence of sentiments expressed in speeches of each category by each party does not significantly vary the average leave percentage.

Also I would like to look at this last relationship with further detail before jumping into a regression analysis. Correlation Analysis for Each Type of Sentiment:

# Create a function to calculate correlation for each sentiment
calculate_correlation <- function(df, sentiment_type) {
  subset_df <- df |> filter(sentiment == sentiment_type)
  cor(subset_df$avg_leave_perc, subset_df$avg_score, use = "complete.obs")
}

# Calculate correlation for each sentiment type
sentiment_types <- levels(final_df$sentiment)
correlations <- sapply(sentiment_types, function(sentiment) {
  calculate_correlation(final_df, sentiment)
})

# Print correlations for each sentiment type
names(correlations) <- sentiment_types
print(correlations)
##    negative     neutral    positive 
## -0.28883771 -0.01732026  0.02238975

And plot this:

# Convert the correlations to a data frame
correlations_df <- data.frame(
  sentiment = names(correlations),
  correlation = correlations
)

# Create a dot plot
ggplot(correlations_df, aes(x = sentiment, y = correlation, color = sentiment)) + 
  geom_point(size = 5) +  # Create dots with a specified size
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +  # Add a dashed line at y = 0 for reference
  theme_minimal(base_size = 15) +  # Minimal theme with increased base size
  labs(
    title = "Correlation Between Avg Leave % and Avg Score by Sentiment Type",
    x = "Sentiment Type",
    y = "Correlation"
  ) +
  geom_text(aes(label = round(correlation, 3)), vjust = -1, size = 5) +  # Add correlation values next to dots
  scale_color_manual(values = c("negative" = "red3", "neutral" = "gray", "positive" = "springgreen")) + 
  ylim(-0.35, 0.1) +  # Adjust y-axis limits to add space at the top
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 16),  # Bold, centered title
    axis.title.x = element_text(face = "bold", size = 14),  # Bold x-axis label
    axis.title.y = element_text(face = "bold", size = 14),  # Bold y-axis label
    axis.text = element_text(size = 12),  # Size for axis tick labels
    legend.position = "none",  # Remove legend
    plot.margin = margin(t = 20, r = 20, b = 30, l = 20),  # Adjust plot margins to add space around the plot
    axis.text.x = element_text(margin = margin(t = 10))  # Add space below the x-axis text labels
  )

Linear regression model:

# Remove rows with NA values before running regression
final_df_complete <- final_df |> 
  drop_na()

# Relevel the party factor to set a different reference category *** CAREFUL = NAs!!!
final_df$party <- relevel(final_df$party, ref = "ukip") #party that supported brexit the most

# Fit a linear regression model
model <- lm(avg_leave_perc ~ party + category*sentiment , data = final_df)
summary(model)
## 
## Call:
## lm(formula = avg_leave_perc ~ party + category * sentiment, data = final_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11106 -0.07906  0.00000  0.07906  0.11106 
## 
## Coefficients: (1 not defined because of singularities)
##                                                             Estimate Std. Error
## (Intercept)                                                5.008e-01  8.942e-02
## partydup                                                   2.249e-02  2.633e-02
## partygreen                                                -2.415e-01  3.353e-02
## partyindependent                                          -7.480e-02  3.342e-02
## partylabour                                               -1.479e-02  2.664e-02
## partylabourco-operative                                    7.596e-03  2.649e-02
## partyliberal-democrat                                     -1.501e-02  2.672e-02
## partyplaid-cymru                                          -3.772e-02  2.818e-02
## partyscottish-national-party                              -4.956e-02  2.591e-02
## partysocial-democratic-and-labour-party                   -2.324e-01  3.389e-02
## partyuup                                                  -8.742e-02  3.901e-02
## categoryDefense/Foreign Affairs                            2.468e-15  1.236e-01
## categoryEconomy/Technology                                 1.318e-15  1.010e-01
## categoryEducation                                          1.326e-15  1.009e-01
## categoryEmployment                                         1.367e-15  1.006e-01
## categoryEnergy/Environment                                 1.402e-15  9.122e-02
## categoryEU/Brexit                                          1.732e-15  1.235e-01
## categoryHealth                                             1.464e-15  9.585e-02
## categoryHousing                                            1.382e-15  9.444e-02
## categoryParliamentary proceedings                         -3.882e-16  4.267e-02
## categorySocial Security/Society/Welfare                    1.475e-15  9.347e-02
## categoryTransport                                          1.389e-15  9.559e-02
## sentimentneutral                                           1.956e-15  9.774e-02
## sentimentpositive                                          1.805e-15  9.263e-02
## categoryDefense/Foreign Affairs:sentimentneutral          -3.099e-15  1.358e-01
## categoryEconomy/Technology:sentimentneutral               -1.739e-15  1.182e-01
## categoryEducation:sentimentneutral                        -1.866e-15  1.164e-01
## categoryEmployment:sentimentneutral                       -2.119e-15  1.459e-01
## categoryEnergy/Environment:sentimentneutral               -1.908e-15  1.096e-01
## categoryEU/Brexit:sentimentneutral                        -2.398e-15  1.601e-01
## categoryHealth:sentimentneutral                           -2.047e-15  1.126e-01
## categoryHousing:sentimentneutral                          -1.830e-15  1.217e-01
## categoryParliamentary proceedings:sentimentneutral        -1.419e-16  6.778e-02
## categorySocial Security/Society/Welfare:sentimentneutral  -1.978e-15  1.161e-01
## categoryTransport:sentimentneutral                        -1.866e-15  1.225e-01
## categoryDefense/Foreign Affairs:sentimentpositive         -2.937e-15  1.317e-01
## categoryEconomy/Technology:sentimentpositive              -1.743e-15  1.092e-01
## categoryEducation:sentimentpositive                       -1.698e-15  1.101e-01
## categoryEmployment:sentimentpositive                      -1.740e-15  1.080e-01
## categoryEnergy/Environment:sentimentpositive              -1.822e-15  1.052e-01
## categoryEU/Brexit:sentimentpositive                       -2.143e-15  1.291e-01
## categoryHealth:sentimentpositive                          -1.827e-15  1.073e-01
## categoryHousing:sentimentpositive                         -1.775e-15  1.046e-01
## categoryParliamentary proceedings:sentimentpositive               NA         NA
## categorySocial Security/Society/Welfare:sentimentpositive -1.887e-15  1.039e-01
## categoryTransport:sentimentpositive                       -1.768e-15  1.043e-01
##                                                           t value Pr(>|t|)    
## (Intercept)                                                 5.600 8.97e-08 ***
## partydup                                                    0.854   0.3943    
## partygreen                                                 -7.202 2.11e-11 ***
## partyindependent                                           -2.238   0.0266 *  
## partylabour                                                -0.555   0.5795    
## partylabourco-operative                                     0.287   0.7747    
## partyliberal-democrat                                      -0.562   0.5751    
## partyplaid-cymru                                           -1.338   0.1827    
## partyscottish-national-party                               -1.913   0.0575 .  
## partysocial-democratic-and-labour-party                    -6.858 1.40e-10 ***
## partyuup                                                   -2.241   0.0264 *  
## categoryDefense/Foreign Affairs                             0.000   1.0000    
## categoryEconomy/Technology                                  0.000   1.0000    
## categoryEducation                                           0.000   1.0000    
## categoryEmployment                                          0.000   1.0000    
## categoryEnergy/Environment                                  0.000   1.0000    
## categoryEU/Brexit                                           0.000   1.0000    
## categoryHealth                                              0.000   1.0000    
## categoryHousing                                             0.000   1.0000    
## categoryParliamentary proceedings                           0.000   1.0000    
## categorySocial Security/Society/Welfare                     0.000   1.0000    
## categoryTransport                                           0.000   1.0000    
## sentimentneutral                                            0.000   1.0000    
## sentimentpositive                                           0.000   1.0000    
## categoryDefense/Foreign Affairs:sentimentneutral            0.000   1.0000    
## categoryEconomy/Technology:sentimentneutral                 0.000   1.0000    
## categoryEducation:sentimentneutral                          0.000   1.0000    
## categoryEmployment:sentimentneutral                         0.000   1.0000    
## categoryEnergy/Environment:sentimentneutral                 0.000   1.0000    
## categoryEU/Brexit:sentimentneutral                          0.000   1.0000    
## categoryHealth:sentimentneutral                             0.000   1.0000    
## categoryHousing:sentimentneutral                            0.000   1.0000    
## categoryParliamentary proceedings:sentimentneutral          0.000   1.0000    
## categorySocial Security/Society/Welfare:sentimentneutral    0.000   1.0000    
## categoryTransport:sentimentneutral                          0.000   1.0000    
## categoryDefense/Foreign Affairs:sentimentpositive           0.000   1.0000    
## categoryEconomy/Technology:sentimentpositive                0.000   1.0000    
## categoryEducation:sentimentpositive                         0.000   1.0000    
## categoryEmployment:sentimentpositive                        0.000   1.0000    
## categoryEnergy/Environment:sentimentpositive                0.000   1.0000    
## categoryEU/Brexit:sentimentpositive                         0.000   1.0000    
## categoryHealth:sentimentpositive                            0.000   1.0000    
## categoryHousing:sentimentpositive                           0.000   1.0000    
## categoryParliamentary proceedings:sentimentpositive            NA       NA    
## categorySocial Security/Society/Welfare:sentimentpositive   0.000   1.0000    
## categoryTransport:sentimentpositive                         0.000   1.0000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08294 on 162 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.5045, Adjusted R-squared:   0.37 
## F-statistic: 3.749 on 44 and 162 DF,  p-value: 4.913e-10

Many of the interaction terms between category and sentiment are not significant, with p-values of 1.0000, indicating no effect on the outcome variable.

Yet we might want to see some visualizations of the analysis:

library(ggplot2)
library(dplyr)

# Filter significant coefficients (p < 0.05)
significant_coefs <- broom::tidy(model) |>
  filter(p.value < 0.05)

# Interaction plot between category and sentiment
interaction_plot <- interaction.plot(
  final_df$sentiment, final_df$category, final_df$avg_leave_perc,
  col = c("red", "green", "blue"), legend = TRUE,
  ylab = "Avg Leave %", xlab = "Sentiment",
  trace.label = "Category"
)

# Coefficients plot: 
# Tidy the model output
tidy_model <- tidy(model, conf.int = TRUE)

# Filter out the intercept for clarity and remove rows with missing estimates
tidy_model <- tidy_model |>
  filter(term != "(Intercept)") |>
  filter(!is.na(estimate))

# Highlight significant variables
tidy_model <- tidy_model |>
  mutate(term_label = ifelse(p.value < 0.05, paste0("<b>", term, "</b>"), term))

# Create the plot
highlighted_plot <- ggplot(tidy_model, aes(x = reorder(term_label, estimate), y = estimate)) +
  geom_point(aes(color = p.value < 0.05), size = 3) +  # Color points by significance
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  coord_flip() +  # Flip the coordinates for better readability
  labs(
    title = "Coefficient Plot of Linear Regression Model",
    x = "Predictors",
    y = "Coefficient Estimate",
    color = "Significant (p < 0.05)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16),
    axis.title = element_text(size = 14),
    axis.text.y = element_markdown(size = 12),  # Use element_markdown to support HTML tags
    axis.text.x = element_text(size = 12),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12)
  )

print(highlighted_plot)

# Residuals vs Fitted Plot
plot(model, which = 1)  # This will give a residuals vs fitted plot

# Q-Q Plot
plot(model, which = 2)  # This will give a Q-Q plot

# Cook's Distance Plot
plot(model, which = 4)  # This will give a Cook's distance plot

The Green Party, Independent Party, Social Democratic and Labour Party, and UUP Party all have significant negative influences. Check party influence alone:

model2 <- lm(avg_leave_perc ~ party, data = final_df)
summary(model2)
## 
## Call:
## lm(formula = avg_leave_perc ~ party, data = final_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11106 -0.07906  0.00000  0.07906  0.11106 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              0.500808   0.015392  32.537  < 2e-16
## partydup                                 0.022488   0.022257   1.010  0.31356
## partygreen                              -0.241472   0.026660  -9.058  < 2e-16
## partyindependent                        -0.074799   0.026660  -2.806  0.00553
## partylabour                             -0.014794   0.021767  -0.680  0.49754
## partylabourco-operative                  0.007596   0.021767   0.349  0.72751
## partyliberal-democrat                   -0.015008   0.021767  -0.689  0.49135
## partyplaid-cymru                        -0.037720   0.022257  -1.695  0.09171
## partyscottish-national-party            -0.049562   0.021767  -2.277  0.02387
## partysocial-democratic-and-labour-party -0.232427   0.027456  -8.466 5.92e-15
## partyuup                                -0.087417   0.030784  -2.840  0.00499
##                                            
## (Intercept)                             ***
## partydup                                   
## partygreen                              ***
## partyindependent                        ** 
## partylabour                                
## partylabourco-operative                    
## partyliberal-democrat                      
## partyplaid-cymru                        .  
## partyscottish-national-party            *  
## partysocial-democratic-and-labour-party ***
## partyuup                                ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0754 on 196 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.5045, Adjusted R-squared:  0.4793 
## F-statistic: 19.96 on 10 and 196 DF,  p-value: < 2.2e-16

11 missing values because the parties “alliance”, “ukip”, and “respect” did not participate in all debates regarding all the topics and thus some NAs are generated

Sentiment evolution through time

brexit_eu_brexit <- brexit_df |> 
  mutate(date = as.Date(date)) |> 
  filter(category == "EU/Brexit") |> 
  group_by(date, sentiment) |> 
  summarize(avg_score = mean(score, na.rm = TRUE), .groups = 'drop')
library(ggplot2)
library(plotly)

# Define custom colors
line_colors <- c("negative" = "coral", "positive" = "darkolivegreen2", "neutral" = "darkslategray2")
smooth_colors <- c("red3", "darkgreen", "deepskyblue3")

# Ensure the date column is in Date format
brexit_eu_brexit$date <- as.Date(brexit_eu_brexit$date)

# Create the plot with geom_line
plot <- ggplot(brexit_eu_brexit, aes(x = date, y = avg_score, group = sentiment)) +
  geom_line(aes(color = sentiment), size = 0.5) +
  scale_color_manual(values = line_colors, name = "Sentiment") +
  labs(title = "Evolution of Average Sentiment Score for EU-Brexit",
       x = "Date",
       y = "Average Sentiment Score") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  ) +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year")

# Add the geom_smooth layers separately with manual colors
for (i in seq_along(smooth_colors)) {
  sentiment_type <- unique(brexit_eu_brexit$sentiment)[i]
  plot <- plot + 
    geom_smooth(data = subset(brexit_eu_brexit, sentiment == sentiment_type), 
                aes(x = date, y = avg_score), 
                method = "loess", 
                se = FALSE, 
                linetype = "dashed", 
                size = 1, 
                color = smooth_colors[i])
}

# Add a vertical line for 23 June 2016
plot <- plot + 
  geom_vline(xintercept = as.Date("2016-06-23"), color = "red", size = 1, linetype = "solid")

plot

We can also have in html an interactive version of the plot

# Make the plot interactive
interactive_plot <- ggplotly(plot)
interactive_plot

Topic modelling for EU/BREXIT topics

IN CASE ANY PACKAGE DOES NOT LOAD CORRECTLY

It might happen, especially if this code is being ran in a Mac computer that some packages such as BERTopic or UAMP do not work correctly as they should or do not load (specifically this might happen in the Topic Modelling section). If this is your case the you should set your own virtual environment through shell before running the following chunks. In order to do, you can follow this steps:

  1. Restart your R session
  2. Run the following code in your shell/terminal:
conda remove -n r-reticulate --all
conda create -n r-reticulate python=3.9
conda activate r-reticulate
pip install bertopic sentence-transformers transformers torch
  1. Then run the following R chunk:
# Set the Python environment to the newly created r-reticulate environment
Sys.setenv(RETICULATE_PYTHON = "/opt/anaconda3/envs/r-reticulate/bin/python")

library(reticulate)

# Check if transformers and BERTopic can be imported without errors
py_run_string("from bertopic import BERTopic")

With this the packages issue should be fixed. As a recommendation right after this I would try to install and call all the python libraries of this notebook.

BERTopic: Running the HF model

Using BERTopic: uses various embedding techniques and c-TF-IDF to create dense clusters allowing for easily interpretable topics.

| eval = False
!pip install --upgrade pip
!pip install transformers datasets
!pip install torch
!pip install matplotlib
!pip install umap-learn
!pip install hdbscan
!pip install nltk
!pip install llvmlite 
!pip install numba
!pip install bertopic
| eval = False
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModel
import torch
from bertopic import BERTopic
from umap import UMAP
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import os
import warnings
import time

os.environ["TOKENIZERS_PARALLELISM"] = "false"
warnings.filterwarnings("ignore", message=".*beta.*")
warnings.filterwarnings("ignore", message=".*gamma.*")

Step 1: Prepare and preprocess the Text (Including Custom Stopwords and Lemmatization)

HF model used: https://medium.com/@okanyenigun/exploring-hugging-face-topic-modeling-7590efe7c3d3

First I would need to load and filter my dataset:

| eval = False
df = pd.read_csv("/Users/lauramartinez/Desktop/Master/TFM/TFM_final/sentiment_python_df.csv")
df.info()
| eval = False
# Fill NaN values in the 'category' column with an empty string
df['category'] = df['category'].fillna('Parliamentary proceedings')

# Now apply the .str.contains() method
brexit_df = df[df['category'].str.contains('EU/Brexit')]
brexit_df.info()

Now preprocess the text:

  • Use a word lemmatizer

  • Filter out regular english stopwords as well as custom stopwords

  • Standarise text converting it into lowercase, removing punctuation etc…

  • Apply the preprocessing to the corpus

| eval = False

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function to preprocess the text
def preprocess_text(text):
    # List of words to remove
    custom_stopwords = set(stopwords.words('english')) | {"hon", "people", "friend", "gentleman", "minister", "member", "house", "right", "does", "say", "want", "prime", "mr", "lord", "yes", "amendment", "uk", "eu", "parliament", "government", "vote", "country", "united", "kingdom", "need", "way", "agreement", "agree", "point", "make", "know", "debate", "think", "time", "business", "european", "referendum", "member", "year", "said", "voted", "claimed", "rid", "issue", "making", "going", "motion", "intervention", "thing", "today", "learned", "come", "place", "british", "level", "union", "party", "called", "talking", "like", "lady", "order", "day", "mp", "speaker", "itll", "able", "unionyes", "actually", "member", "gain", "conversation", "giving", "given" }
    # Lowercase the text
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'[^\w\s\']', '', text)  # Remove punctuation but retain apostrophes
    text = re.sub(r'\b\d+\b', '', text)  # Remove numbers
    # Lemmatize and remove stopwords
    cleaned_text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in custom_stopwords])
    return cleaned_text
  
# Apply the preprocessing function to the 'speech' column
brexit_df['speech'] = brexit_df['speech'].apply(preprocess_text)

Step 2: Use BERTopic for Topic Modeling

To improve the topic assignment and potentially split the large cluster into more meaningful topics, you can adjust the parameters of BERTopic. Specifically, you can change the min_topic_size parameter to control the minimum size of the topics and the nr_topics parameter to set the number of topics you desire.

| eval = False

# Step 1: Initialize tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased").to('cuda' if torch.cuda.is_available() else 'cpu')

# Step 2: Function to handle long texts by splitting into chunks and averaging their embeddings
def embed_long_text(text, tokenizer, model, chunk_size=512):
  # Preprocess the text before tokenizing
    text = preprocess_text(text)
  # Tokenize the preprocessed text into a list of tokens (subwords)
    tokens = tokenizer.tokenize(text)
    if len(tokens) == 0:
        return np.zeros((768,))  # Return a zero vector if there are no tokens
     # Calculate the number of chunks needed to process the tokens.
    num_chunks = (len(tokens) + chunk_size - 1) // chunk_size
    all_embeddings = []
    # Loop over each chunk
    for i in range(num_chunks):
      # Extract the tokens for the current chunk based on the chunk index
        chunk_tokens = tokens[i * chunk_size:(i + 1) * chunk_size]
      # Convert the chunk of tokens back to a string format for BERT input
        chunk_text = tokenizer.convert_tokens_to_string(chunk_tokens)
# Tokenize and encode the chunked text, returning tensors for PyTorch and ensuring the maximum length is `chunk_size`.
        encoded_input = tokenizer(chunk_text, return_tensors='pt', truncation=True, max_length=chunk_size).to('cuda' if torch.cuda.is_available() else 'cpu')
# Disable gradient computation as we're only performing inference, not training
        with torch.no_grad():
          # Pass the encoded input through the model to obtain the output
            outputs = model(**encoded_input)
        # Compute the mean of the last hidden state embeddings
        chunk_embedding = outputs.last_hidden_state.mean(dim=1).cpu().numpy()
        if chunk_embedding.shape == (1, 768):  # Model output shape should be (1, 768)
            all_embeddings.append(chunk_embedding[0])
        else:
            print(f"Unexpected shape {chunk_embedding.shape} for chunk {i} in text: {text[:30]}")

    # If there are valid chunk embeddings, return the mean of all chunk embeddings as the final embedding for the text
    if all_embeddings:
        return np.mean(all_embeddings, axis=0)
    else:
        return np.zeros((768,))  # Return a zero vector if no valid embeddings

# Embedding speeches
embeddings = []
preprocessed_speeches = brexit_df['speech'].tolist()
start_time = time.time()

for idx, speech in enumerate(preprocessed_speeches):
  # Convert the 'speech' column of the dataframe `brexit_df` to a list of preprocessed speeches
    embedding = embed_long_text(speech, tokenizer, model)
    if embedding.shape == (768,):  # Ensure that the embedding has the correct shape
        embeddings.append(embedding)
    else:
        print(f"Skipping speech {idx} due to incorrect embedding shape.")
    # Log progress every 100 rows
    if (idx + 1) % 100 == 0:
        elapsed_time = time.time() - start_time
        print(f"Processed {idx + 1}/{len(brexit_df)} rows. Elapsed time: {elapsed_time:.2f} seconds.")
#If a warning on tokens length shows up stop the running and run this last chunk again

# Convert list of embeddings to a numpy array
embeddings = np.array(embeddings)
assert embeddings.shape[1] == 768, f"Embeddings shape mismatch: {embeddings.shape[1]} != 768"
| eval = False
# Adjust UMAP parameters for potentially better topic separation
umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.25, metric='cosine', random_state=42)

# Adjust BERTopic parameters
topic_model = BERTopic(umap_model=umap_model, verbose=True, min_topic_size=8)  # Decrease min_topic_size to allow more topics

# Fit the model
topics, probs = topic_model.fit_transform(preprocessed_speeches[:len(embeddings)], embeddings)

# Verify the lengths
print(f"Number of speeches processed: {len(preprocessed_speeches)}")
print(f"Number of embeddings: {len(embeddings)}")
print(f"Number of topics: {len(topics)}")

# Add topics to the DataFrame
brexit_df.loc[brexit_df.index[:len(topics)], 'topic'] = topics

# Save the updated DataFrame to a new CSV file
brexit_df.to_csv('/Users/lauramartinez/Desktop/Master/TFM/TFM_final/clustered_speech.csv', index=False)

Visualization of the embedded topics:

| eval = False

# Reduce dimensions using UMAP for visualization
reduced_embeddings = umap_model.fit_transform(embeddings)

# Visualize the reduced embeddings with topic labels
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
unique_topics = np.unique(topics)
for topic in unique_topics:
    idx = np.where(topics == topic)
    plt.scatter(reduced_embeddings[idx, 0], reduced_embeddings[idx, 1], label=f"Topic {topic}")
    

plt.legend()
plt.title("Topic Visualization with UMAP")
plt.xlabel("UMAP Dimension 1")
plt.ylabel("UMAP Dimension 2")

# Save the plot to an image file
plt.savefig("topic_visualization_umap.png", dpi=300, bbox_inches='tight')  # Save as PNG with high resolution

plt.show()

New Topic Assignments by BERTopic:

  • BERTopic identifies topics based on the embeddings and their distribution in the high-dimensional space. The topics identified by BERTopic do not necessarily match the topics in your original DataFrame, especially if there is a semantic or structural difference in how topics are represented in the embeddings.
  • Also it assigns outliers to topic -1

Step 3: visualize topics

| eval = False
def get_top_n_words(n, corpus, vectorizer):
  # Check if the corpus is empty
    if not corpus:
        return []
  # Fit the vectorizer to the corpus.
    vec = vectorizer.fit(corpus)
  # Transform the corpus into a bag-of-words representation. 
    bag_of_words = vec.transform(corpus)
  # Sum the occurrences of each word across all documents in the corpus.
    sum_words = bag_of_words.sum(axis=0)
  # The words and their indices are retrieved from the vectorizer's vocabulary.
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
  # Sort the list of tuples by word frequency in descending order.
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    return words_freq[:n]

# Initialize the vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Extract the top n words for each topic
top_n_words_per_topic = []

for topic in sorted(brexit_df['topic'].unique()):  # Get the number of unique topics
    topic_speeches = brexit_df[brexit_df['topic'] == topic]['speech'].tolist()
    if not topic_speeches:  # Check if the topic_speeches is empty
        continue
    top_words = get_top_n_words(20, topic_speeches, vectorizer)
    if top_words:  # Check if top_words is not empty
        top_n_words_per_topic.append((topic, top_words))

# Debug: Print the topics and number of top words found
for topic, words in top_n_words_per_topic:
    print(f"Topic {topic}: {len(words)} top words")

# Plotting the top words for each topic
num_topics = len(top_n_words_per_topic)
fig, axes = plt.subplots(2, 3, figsize=(15, 10), sharex=False)
axes = axes.flatten()

for i, (topic, top_words) in enumerate(top_n_words_per_topic):
    words = [word for word, freq in top_words]
    freqs = [freq for word, freq in top_words]
    axes[i].barh(words, freqs, color='skyblue')
    axes[i].set_title(f'Topic {topic}')
    axes[i].invert_yaxis()

# Turn off any remaining empty axes
for ax in axes[num_topics:]:
    ax.axis('off')

plt.tight_layout()

# Save the plot with a different name
plt.savefig("top_words_per_topic.png", dpi=300, bbox_inches='tight')  # Save as PNG with high resolution

plt.show()

Topic -1 and Topic 0 have fewer significant words compared to Topic 2 and Topic 3. This might be an indication of less coherence or fewer documents in those topics.

K-Means method:

| eval = False
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import warnings
from scipy.sparse import SparseEfficiencyWarning


# Reset the df we will work with
brexit_df = df[df['category'].str.contains('EU/Brexit')]
brexit_df.info()

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function to preprocess the text
def preprocess_text(text):
    # List of words to remove
    custom_stopwords = set(stopwords.words('english')) | {"hon", "people", "friend", "gentleman", "minister", "member", "house", "right", "does", "say", "want", "prime", "mr", "lord", "yes", "amendment", "uk", "eu", "parliament", "government", "vote", "country", "united", "kingdom", "need", "way", "agreement", "agree", "point", "make", "know", "debate", "think", "time", "business", "european", "referendum", "member", "year", "said", "voted", "claimed", "rid", "issue", "making", "going", "motion", "intervention", "thing", "today", "learned", "come", "place", "british", "level", "union", "party", "called", "talking", "like", "lady", "order", "day", "mp", "speaker", "itll", "able", "unionyes", "actually", "member", "gain", "conversation", "giving", "given" }
    # Lowercase the text
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'[^\w\s\']', '', text)  # Remove punctuation but retain apostrophes
    text = re.sub(r'\b\d+\b', '', text)  # Remove numbers
    # Lemmatize and remove stopwords
    cleaned_text = ' '.join([lemmatizer.lemmatize(word) for word in text.split() if word not in custom_stopwords])
    return cleaned_text
  
# Apply the preprocessing function to the 'speech' column
brexit_df['speech'] = brexit_df['speech'].apply(preprocess_text)

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
warnings.simplefilter("ignore", SparseEfficiencyWarning)

# Vectorize the preprocessed speech column
tfidf_matrix = vectorizer.fit_transform(brexit_df['speech'])

from sklearn.cluster import KMeans

# Perform clustering
kmeans = KMeans(n_clusters=6, random_state=0)
labels = kmeans.fit_predict(tfidf_matrix)

# Ensure the length of labels matches the original DataFrame
assert len(labels) == len(brexit_df), "Length of labels does not match the length of the DataFrame."

# Add topic labels to the DataFrame
brexit_df['topic'] = labels

# Save the updated DataFrame to a new CSV file
brexit_df.to_csv('/Users/lauramartinez/Desktop/Master/TFM/TFM_final/clustered_speech2.csv', index=False)

# Optional: Display the first few rows to verify
print(brexit_df.head())


import numpy as np
from collections import Counter
import matplotlib.pyplot as plt

# Function to get the top words for each topic
def get_top_n_words(n, corpus, vectorizer):
    vec = vectorizer.fit(corpus)
    # transform the corpus into a bag-of-words model.
    bag_of_words = vec.transform(corpus)
    # Sum the TF-IDF scores of each word across all documents in the corpus.
    sum_words = bag_of_words.sum(axis=0)
    # Create a list of tuples where each tuple contains a word and its total TF-IDF score in the corpus.
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    # Sort the list of words by their total TF-IDF score in descending order.
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    return words_freq[:n]

# Initialize a TF-IDF vectorizer with English stop words removed.
vectorizer = TfidfVectorizer(stop_words='english')
top_n_words_per_topic = []

# Loop over each topic (here assuming there are 6 topics or clusters).
for topic in range(6):  # Changed to 6 clusters
    topic_speeches = brexit_df[brexit_df['topic'] == topic]['speech']
    # Get the top 20 words 
    top_words = get_top_n_words(20, topic_speeches, vectorizer)
    # Append the top words for the current topic to the list.
    top_n_words_per_topic.append(top_words)
| eval = False
# Plotting
fig, axes = plt.subplots(2, 3, figsize=(15, 10), sharex=True)  # Changed to 2 rows and 3 columns
axes = axes.flatten()

for i, ax in enumerate(axes):
    top_words = top_n_words_per_topic[i]
    words = [word for word, freq in top_words]
    freqs = [freq for word, freq in top_words]
    ax.barh(words, freqs)
    ax.set_title(f'Topic {i}')
    ax.invert_yaxis()

plt.tight_layout()

plt.savefig("top_words_per_topic_kmeans.png", dpi=300, bbox_inches='tight')  # Save as PNG with high resolution

plt.show()

Now we compare both methods.

Types of Brexit - Most correlated words

Hard Brexit, soft brexit, no-deal, remain

#Restore your dataset for a new analysis
brexit_topics <- read.csv('python_df.csv')
brexit_topics <- brexit_topics |> 
  select(date, party, category, speech) |> 
  filter(category == "EU/Brexit")

str(brexit_topics)
## 'data.frame':    3780 obs. of  4 variables:
##  $ date    : chr  "2013-01-21" "2013-01-21" "2013-01-21" "2013-01-21" ...
##  $ party   : chr  "conservative" "conservative" "conservative" "conservative" ...
##  $ category: chr  "EU/Brexit" "EU/Brexit" "EU/Brexit" "EU/Brexit" ...
##  $ speech  : chr  "With this it will be convenient to discuss the following: Amendment 7, page 1, line 4, leave out ‘1%’ and inser"| __truncated__ "Do the Opposition want to make it more worth while to be in work than out of work, and if so, how would they do"| __truncated__ "Is it therefore the right hon Gentleman’s and the Opposition’s policy that uprating should be not by 1%, but by"| __truncated__ "The right hon Gentleman has twice from the Dispatch Box repeated the commitment to uprate benefits by inflation"| __truncated__ ...

After the 2016 UK referendum, where the majority voted to leave the European Union, several potential Brexit scenarios were discussed:

  1. Hard Brexit: This scenario involved the UK leaving the EU without any agreements in place, thus exiting both the single market and the customs union. This would mean relying on World Trade Organization (WTO) rules for trade, leading to significant regulatory divergence and potential economic disruptions​ (World Economic Forum)​​ (SPGlobal)​.

  2. Soft Brexit: In a soft Brexit, the UK would leave the EU but retain close ties, such as remaining in the single market (similar to Norway) or the customs union (similar to Turkey). This would minimize trade barriers and maintain some regulatory alignment with the EU, reducing economic impact​ (LSE Blogs)​​ (World Economic Forum)​.

  3. Remain: Although the referendum decided in favor of leaving, there were calls for a second referendum or parliamentary actions to reverse the decision, ultimately aiming to keep the UK within the EU​ (LSE Blogs)​.

  4. Customs Union Only: The UK could have left the EU but stayed in the customs union, allowing tariff-free trade for goods but not services. This would help avoid a hard border in Ireland but limit the UK’s ability to forge independent trade agreements​ (LSE Blogs)​​ (SPGlobal)​.

  5. Bespoke Free Trade Agreement: This would involve negotiating a unique, comprehensive free trade agreement tailored to the UK’s needs. Such an agreement would likely be more complex and require extensive negotiations​ (SPGlobal)​.

  6. No Deal Transition: Leaving the EU without a withdrawal agreement but implementing a transitional arrangement to mitigate immediate disruptions and provide more time for future relationship negotiations​ (SPGlobal)​.

  7. Second Referendum: A call for another public vote on the final Brexit deal, which could result in either accepting the negotiated deal, opting for a different Brexit scenario, or remaining in the EU​ (LSE Blogs)​.

These scenarios reflected the broad spectrum of possibilities considered by policymakers, analysts, and the public as the UK navigated its exit from the EU. Ultimately, the UK pursued a form of Brexit that involved leaving the single market and customs union but with a negotiated trade deal in place.

Preprocess Text Data: create a single Corpus

# Load required packages
library(dplyr)
library(tidyr)
library(tidytext)
library(widyr)

# Create your custom stopwords list
custom_stopwords <- c(stopwords("en"), "hon", "people", "friend", "gentleman", 
                      "minister", "member", "house", "right", "does", "say", 
                      "want", "prime", "mr",  "yes", "amendment", "just",
                      "absolutely", "kingdom", "need", "way",  "agree", "now",
                      "point", "make", "know", "debate", "think", "members",
                      "time", "business", "european", "member", "among",
                      "year", "said", "many","day", "perhaps", "day",
                      "voted", "claimed", "rid",  "making", "going", 
                      "motion","thing", "today", "learned","come", "place", 
                      "british", "level", "us", "one", "years", "uk",
                      "called", "talking", "like", "lady", "order", "mp", "will")

# Remove "no" from the custom stopwords list
custom_stopwords <- setdiff(custom_stopwords, "no")

# Verify that "no" has been removed
"no" %in% custom_stopwords  # Should return FALSE
## [1] FALSE

Now make a custom set of stopwords for each type of N-Gram

as_tibble(..., .name_repair = "minimal"):

  • This ensures that the tibble is created with the default column name value
# Custom unigram stopwords
stopwords_unigram <- as_tibble(c(custom_stopwords, "eu", "union", "us", "one", "no", "can", "referendum", "many", "country", "leave", "brexit", "let", "much", "also", "even", "let", "made", "well", "get", "go", "must"), .name_repair = "minimal") |>
  rename(word = value)

# Custom bigram stopwords
stopwords_bigram <- as_tibble(c(custom_stopwords, "eu", "us", "one", "members", "perhaps", "view", "minister's", "left", "white", "never", "now", "apart"), .name_repair = "minimal") |>
  rename(word = value)

# Custom trigram stopwords
stopwords_trigram <- as_tibble(c(custom_stopwords, "no", "one", "protocol", "towards"), .name_repair = "minimal") |>
  rename(word = value)

Now we create dataframes with all the speeches tokenised in unigrams, bigrams and trigrams respectively. Then we filter them according to their frequency to have smaller dataframes.

# Unigrams
unigrams <- brexit_topics |> 
  unnest_tokens(word, speech, drop = FALSE) |> 
  anti_join(stopwords_unigram, by = "word")

unigrams <- unigrams |>
  group_by(word) |>
  filter(n() >= 5) |>
  ungroup()

# Bigrams
bigrams <- brexit_topics |> 
  unnest_tokens(bigram, speech, token = "ngrams", n = 2, drop = FALSE) |> 
  separate(bigram, into = c("word1", "word2"), sep = " ") |>   # Separate bigrams 
  filter(!word1 %in% stopwords_bigram$word & !word2 %in% stopwords_bigram$word) |> 
  unite(bigram, word1, word2, sep = " ")

bigrams <- bigrams |> 
  group_by(bigram) |> 
  filter(n() >= 4) |>   #adjust
  ungroup()

# Trigrams
trigrams <- brexit_topics |> 
  unnest_tokens(trigram, speech, token = "ngrams", n = 3, drop = FALSE) |> 
  separate(trigram, into = c("word1", "word2", "word3"), sep = " ") |>  
  filter(!word1 %in% stopwords_trigram$word & !word2 %in% stopwords_trigram$word & !word3 %in% stopwords_trigram$word) |> 
  unite(trigram, word1, word2, word3, sep = " ")

trigrams <- trigrams |> 
  group_by(trigram) |> 
  filter(n() >= 3) |>   
  ungroup()

# Term frequencies for unigrams
unigram_freq <- unigrams |> 
  count(word, sort = TRUE)

# Term frequencies for bigrams
bigram_freq <- bigrams |> 
  count(bigram, sort = TRUE)

# Term frequencies for trigrams
trigram_freq <- trigrams |> 
  count(trigram, sort = TRUE)
  
 
# Pairwise correlations for unigrams
unigram_corr <- unigrams |>
  pairwise_cor(word, speech, sort = TRUE)

# Pairwise correlations for bigrams
bigram_corr <- bigrams |> 
  pairwise_cor(bigram, speech, sort = TRUE)

# Pairwise correlations for trigrams
trigram_corr <- trigrams |>
  pairwise_cor(trigram, speech, sort = TRUE)

# We define n-grams
target_ngrams <- c("no-deal brexit", "no deal brexit", "hard brexit", "customs union", 
                   "free trade agreement", "no deal", "second referendum", "remain", 
                   "soft brexit")

# Function to get correlated words for given n-gram
get_correlated_words <- function(corr_df, ngram) {
  corr_df |>
    filter(item1 == ngram | item2 == ngram) |>
    arrange(desc(correlation))
}

# Get correlated words for each target n-gram applying the function
correlated_words <- lapply(target_ngrams, function(ngram) {
  unigram_res <- get_correlated_words(unigram_corr, ngram)
  bigram_res <- get_correlated_words(bigram_corr, ngram)
  trigram_res <- get_correlated_words(trigram_corr, ngram)
  list(unigram = unigram_res, bigram = bigram_res, trigram = trigram_res)
})

names(correlated_words) <- target_ngrams

Now we create a function to extract the top correlated n-grams:

# Function to extract top N correlated words
get_top_correlated_words <- function(corr_df, ngram, top_n = 30) {
  corr_df |>
    filter(item1 == ngram | item2 == ngram) |>
    arrange(desc(correlation)) |>
    slice_head(n = top_n)
}

# Extract top correlated words for each target n-gram
top_correlated_words <- lapply(target_ngrams, function(ngram) {
  unigram_res <- get_top_correlated_words(unigram_corr, ngram)
  bigram_res <- get_top_correlated_words(bigram_corr, ngram)
  trigram_res <- get_top_correlated_words(trigram_corr, ngram)
  
  unigram_res$ngram_type <- "unigram"
  bigram_res$ngram_type <- "bigram"
  trigram_res$ngram_type <- "trigram"
  
  bind_rows(unigram_res, bigram_res, trigram_res) |>
    mutate(target_ngram = ngram)
})

top_correlated_words_df <- bind_rows(top_correlated_words)

# Prepare data for plotting
top_correlated_words_df <- top_correlated_words_df |>
  mutate(word = ifelse(item1 == target_ngram, item2, item1)) |>
  select(target_ngram, ngram_type, word, correlation) |>
  arrange(target_ngram, ngram_type, desc(correlation))

Now plot the results:

ggplot(top_correlated_words_df, aes(x = reorder_within(word, correlation, target_ngram), y = correlation, fill = ngram_type)) +
  geom_col(show.legend = FALSE, color = NA) + 
  facet_wrap(~ target_ngram, scales = "free_y", nrow = 2, strip.position = "top") + 
  coord_flip() +
  scale_fill_brewer(palette = "Set1") + 
  labs(
    title = "Top Correlated Words for Each N-Gram",
    subtitle = "A comparison of word associations across different n-grams",
    x = "Words",
    y = "Correlation"
  ) +
  scale_x_reordered() +  # This function correctly reorders within each facet
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 14),
    axis.text.y = element_text(size = 12),
    strip.text = element_text(size = 12, face = "bold"),
    panel.grid.major = element_line(color = "gray85"),
    panel.grid.minor = element_blank()
  )

N-gram frequency by party (Brexit outcomes)

library(dplyr)
library(ggplot2)
library(ggimage)
library(cowplot)

# 1. Combine Unigrams, Bigrams, and Trigrams
# Create/rename the ngram column for consistency

bigrams <- bigrams |> rename(ngram = bigram)
trigrams <- trigrams |> rename(ngram = trigram)

# Adding a type column to identify n-gram types
unigrams$type <- "unigram"
bigrams$type <- "bigram"
trigrams$type <- "trigram"

# Combine all n-grams into a single data frame
all_ngrams <- bind_rows(unigrams, bigrams, trigrams)

# 2. Calculate Frequencies
# Count frequencies of each n-gram by party
ngram_party_freq <- all_ngrams |>
  filter(ngram %in% target_ngrams) |>
  count(party, ngram, sort = TRUE)

# Calculate total occurrences of each n-gram across all parties
total_ngram_freq <- ngram_party_freq |>
  group_by(ngram) |>
  summarise(total_n = sum(n))

# 3. Calculate Relative Frequencies
# Join the total frequencies with the party frequencies to get relative frequencies
ngram_relative_freq <- ngram_party_freq |>
  inner_join(total_ngram_freq, by = "ngram") |>
  mutate(relative_freq = n / total_n)

# 4. Merge with Party Logos
# Assuming you have your `party_logos` data frame already set up with party names and logo paths
ngram_relative_freq_with_logos <- ngram_relative_freq |>
  left_join(party_logos, by = "party")

# 5. Find the Party with the Highest Relative Frequency for Each N-Gram
top_party_for_ngram <- ngram_relative_freq_with_logos |>
  group_by(ngram) |>
  slice_max(relative_freq, with_ties = FALSE) |>
  ungroup()
# Define party colors with unused levels included
party_colors <- c(
  "labour" = "red3",
  "conservative" = "dodgerblue3",
  "scottish-national-party" = "lemonchiffon",
  "labourco-operative" = "darkmagenta",
  "dup" = "darksalmon",
  "liberal-democrat" = "goldenrod1"
)

# Convert target_ngram to title case (capitalize first letters)
top_party_for_ngram <- top_party_for_ngram |>
  mutate(ngram = str_to_title(ngram))

# Plot the results
plot <- ggplot(top_party_for_ngram, aes(x = reorder(ngram, -relative_freq), y = relative_freq, fill = party)) +
  geom_col(width = 0.7) +
  geom_image(aes(image = logo, y = relative_freq + 0.05), size = 0.11) +  # Adjust size and position as needed
  scale_fill_manual(values = party_colors, drop = TRUE) +  # drop = TRUE removes unused levels from the legend
  coord_flip() +
  labs(
    title = "Top Party by Relative Frequency of Each Target N-Gram",
    x = "N-Grams",
    y = "Relative Frequency",
    fill = "Party"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    axis.text.y = element_text(size = 12, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.title.x = element_text(size = 14),
    legend.position = "right",  # Adjust legend position as needed
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_blank()
  )

# Display final plot
print(plot)

Frequency of Brexit outcomes across time

See which party uses each most

# Convert date column to Date type
brexit_topics <- brexit_topics |>  
  mutate(date = ymd(date))
str(brexit_topics)
## 'data.frame':    3780 obs. of  4 variables:
##  $ date    : Date, format: "2013-01-21" "2013-01-21" ...
##  $ party   : chr  "conservative" "conservative" "conservative" "conservative" ...
##  $ category: chr  "EU/Brexit" "EU/Brexit" "EU/Brexit" "EU/Brexit" ...
##  $ speech  : chr  "With this it will be convenient to discuss the following: Amendment 7, page 1, line 4, leave out ‘1%’ and inser"| __truncated__ "Do the Opposition want to make it more worth while to be in work than out of work, and if so, how would they do"| __truncated__ "Is it therefore the right hon Gentleman’s and the Opposition’s policy that uprating should be not by 1%, but by"| __truncated__ "The right hon Gentleman has twice from the Dispatch Box repeated the commitment to uprate benefits by inflation"| __truncated__ ...
# Define the phrases to search for
brexit_outcomes <- c("no-deal brexit", "no deal brexit", "hard brexit", "customs union", 
                   "free trade agreement", "no deal", "second referendum", "remain", 
                   "soft brexit")

# Combine unigrams, bigrams, and trigrams into a single data frame
combined_tokens <- bind_rows(
  unigrams |> mutate(type = "unigram", outcome = unigrams$ngram),
  bigrams  |> mutate(type = "bigram", outcome = bigrams$ngram),
  trigrams |> mutate(type = "trigram", outcome = trigrams$ngram)
)

# Filter for the specific phrases and count their occurrences
phrase_counts <- combined_tokens |> 
  filter(outcome %in% brexit_outcomes) |> 
  count(date, outcome, sort = TRUE) |> 
  group_by(date, outcome) |> 
  summarize(count = sum(n)) |> 
  ungroup()



# Plot the frequency of phrases over time
phrase_counts |> 
  ggplot(aes(x = date, y = count, color = outcome, group = outcome)) +  # Added group = phrase
  geom_line() +
  labs(title = "Frequency of Brexit-related Phrases Over Time",
       x = "Date",
       y = "Frequency",
       color = "Phrase") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 70, hjust = 1))  # Rotate x-axis labels for better readability

Hot topics

Same analysis is done for the following hot topics:

# Define the new target n-grams:
new_target_ngrams <- c("single market", "customs union", "immigration", "sovereignty",
                       "expensive", "migrant crisis", "populism", "nationalism", 
                       "globalisation", "eurozone crisis")
# Get correlated words:
new_correlated_words <- lapply(new_target_ngrams, function(ngram) {
  unigram_res <- get_correlated_words(unigram_corr, ngram)
  bigram_res <- get_correlated_words(bigram_corr, ngram)
  trigram_res <- get_correlated_words(trigram_corr, ngram)
  list(unigram = unigram_res, bigram = bigram_res, trigram = trigram_res)
})

names(new_correlated_words) <- new_target_ngrams
# Get top correlated words:
new_top_correlated_words <- lapply(new_target_ngrams, function(ngram) {
  unigram_res <- get_top_correlated_words(unigram_corr, ngram)
  bigram_res <- get_top_correlated_words(bigram_corr, ngram)
  trigram_res <- get_top_correlated_words(trigram_corr, ngram)
  
  unigram_res$ngram_type <- "unigram"
  bigram_res$ngram_type <- "bigram"
  trigram_res$ngram_type <- "trigram"
  
  bind_rows(unigram_res, bigram_res, trigram_res) |> 
    mutate(target_ngram = ngram)
})

new_top_correlated_words_df <- bind_rows(new_top_correlated_words)
# Prepare and plot the results:
new_top_correlated_words_df <- new_top_correlated_words_df |>
  mutate(word = ifelse(item1 == target_ngram, item2, item1)) |>
  select(target_ngram, ngram_type, word, correlation) |>
  arrange(target_ngram, ngram_type, desc(correlation))

ggplot(new_top_correlated_words_df, aes(x = reorder_within(word, correlation, target_ngram), 
                                        y = correlation, fill = ngram_type)) +
  geom_col(show.legend = FALSE, color = NA) + 
  facet_wrap(~ target_ngram, scales = "free_y", nrow = 2, strip.position = "top") + 
  coord_flip() +
  scale_fill_brewer(palette = "Set1") + 
  labs(
    title = "Top Correlated Words for New N-Grams",
    subtitle = "A comparison of word associations across different n-grams",
    x = "Words",
    y = "Correlation"
  ) +
  scale_x_reordered() +  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 14),
    axis.text.y = element_text(size = 12),
    strip.text = element_text(size = 12, face = "bold"),
    panel.grid.major = element_line(color = "gray85"),
    panel.grid.minor = element_blank()
  )

Frequency of Hot topics across time

See which party uses each hot topic most:

# Convert date column to Date type
brexit_topics <- brexit_topics |>  
  mutate(date = ymd(date))
str(brexit_topics)
## 'data.frame':    3780 obs. of  4 variables:
##  $ date    : Date, format: "2013-01-21" "2013-01-21" ...
##  $ party   : chr  "conservative" "conservative" "conservative" "conservative" ...
##  $ category: chr  "EU/Brexit" "EU/Brexit" "EU/Brexit" "EU/Brexit" ...
##  $ speech  : chr  "With this it will be convenient to discuss the following: Amendment 7, page 1, line 4, leave out ‘1%’ and inser"| __truncated__ "Do the Opposition want to make it more worth while to be in work than out of work, and if so, how would they do"| __truncated__ "Is it therefore the right hon Gentleman’s and the Opposition’s policy that uprating should be not by 1%, but by"| __truncated__ "The right hon Gentleman has twice from the Dispatch Box repeated the commitment to uprate benefits by inflation"| __truncated__ ...
# Define the phrases to search for
phrases <- c("single market", "customs union", "immigration",
             "sovereignty","expensive", "migrant crisis",
             "populism", "nationalism", "globalisation", "eurozone crisis")

# Combine unigrams, bigrams, and trigrams into a single data frame
combined_tokens <- bind_rows(
  unigrams |> mutate(type = "unigram", outcome = unigrams$ngram),
  bigrams  |> mutate(type = "bigram", outcome = bigrams$ngram),
  trigrams |> mutate(type = "trigram", outcome = trigrams$ngram)
)

# Filter for the specific phrases and count their occurrences
phrase_counts <- combined_tokens |> 
  filter(outcome %in% phrases) |> 
  count(date, outcome, sort = TRUE) |> 
  group_by(date, outcome) |> 
  summarize(count = sum(n)) |> 
  ungroup()


# Plot the frequency of phrases over time
phrase_counts |> 
  ggplot(aes(x = date, y = count, color = outcome, group = outcome)) +  # Added group = phrase
  geom_line() +
  labs(title = "Frequency of Brexit-related Phrases Over Time",
       x = "Date",
       y = "Frequency",
       color = "Phrase") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 70, hjust = 1))  # Rotate x-axis labels for better readability

Hate speech with python

Regarding the model used for detecting Hate speech, the unhcr/hatespeech-detection model turned out to be the best option as the RoBERTa version was also tested and the first model yielded better results. The RoBERTa model did not classify any speech as hate or offensive speech, so this proves that it was not adequate at all. than the unhcr model.

The workflow for running the model is the following

| eval = False
!pip install --upgrade pip
!pip install python-dotenv
from dotenv import load_dotenv
from huggingface_hub import login

# Load environment variables from the .env file
load_dotenv()

# Retrieve the Hugging Face API token from the environment variable
token = os.getenv('HUGGINGFACE_TOKEN')
token = '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'
# Login to Hugging Face using your API token
login(token=token)

Following this the base dataset will be called:

| eval = False
import pandas as pd
import numpy as np
import os
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.nn.functional import softmax
import torch
from dotenv import load_dotenv
from huggingface_hub import login
import time

# Disable parallelism warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Load the data
df = pd.read_csv("/Users/lauramartinez/Desktop/Master/TFM/TFM_final/python_df.csv")

And finally the model will be ran thorugh the data by:

| eval = False
# Load the model and tokenizer
model_name = "unhcr/hatespeech-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Ensure model is using GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

Following this:

| eval = False
# Define label mapping
id2label = {0: "Normal", 1: "Offensive", 2: "Hate speech"}

# Preprocessing function
def preprocess(text):
    return text.strip()

# Step 4a: Define a function to get the hate speech label and score for a given text
def get_hate_speech_score(text):
    text = preprocess(text)
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512).to(device)
    with torch.no_grad():
        output = model(**encoded_input)
    scores = output.logits[0].cpu().numpy()
    scores = softmax(torch.tensor(scores), dim=0).numpy()
    ranking = np.argsort(scores)[::-1]
    label = id2label[ranking[0]]  
    score = scores[ranking[0]]
    return label, float(score)

# Step 4b: Define a function to handle long texts by splitting into chunks
def get_hate_speech_score_for_long_text(text):
    text = preprocess(text)
    tokens = tokenizer.tokenize(text)
    chunk_size = 512
    num_chunks = (len(tokens) + chunk_size - 1) // chunk_size

    all_scores = np.zeros(len(id2label))

    for i in range(num_chunks):
        chunk_tokens = tokens[i * chunk_size:(i + 1) * chunk_size]
        chunk_text = tokenizer.convert_tokens_to_string(chunk_tokens)
        encoded_input = tokenizer(chunk_text, return_tensors='pt', truncation=True, max_length=chunk_size).to(device)
        with torch.no_grad():
            output = model(**encoded_input)
        scores = output.logits[0].cpu().numpy()
        scores = softmax(torch.tensor(scores), dim=0).numpy()
        all_scores += scores

    all_scores /= num_chunks
    ranking = np.argsort(all_scores)[::-1]
    label = id2label[ranking[0]]
    score = all_scores[ranking[0]]
    return label, float(score)

# Tokenize all texts first to check their lengths
token_lengths = df['speech'].apply(lambda x: len(tokenizer.tokenize(preprocess(x))))

# Apply the function to each row in the "speech" column
labels = []
scores = []

start_time = time.time()

for idx, (speech, length) in enumerate(zip(df['speech'], token_lengths)):
    if length > 512:
        label, score = get_hate_speech_score_for_long_text(speech)
    else:
        label, score = get_hate_speech_score(speech)
    labels.append(label)
    scores.append(score)
    
    # Log progress every 100 rows
    if (idx + 1) % 100 == 0:
        elapsed_time = time.time() - start_time
        print(f"Processed {idx + 1}/{len(df)} rows. Elapsed time: {elapsed_time:.2f} seconds.")

# Add the results to new columns
df['label'] = labels
df['score'] = scores

import matplotlib.pyplot as plt
import seaborn as sns

# Plot the distribution of labels
plt.figure(figsize=(8, 6))
sns.countplot(x=df['label'])
plt.title('Label Distribution')
plt.xlabel('Label')
plt.ylabel('Count')

# Save the plot with a different name
plt.savefig("labeled_speech.png", dpi=300, bbox_inches='tight')  # Save as PNG with high resolution

plt.show()


# Save the updated DataFrame to a new CSV file
df.to_csv("/Users/lauramartinez/Desktop/Master/TFM/TFM_final/labeled_speech_df.csv", index=False)
#check if NAs have been created
labeled_speech_df <- read_csv("labeled_speech_df.csv")
str(labeled_speech_df)
na_indices <- is.na(labeled_speech_df)
labeled_speech_df[na_indices]

Now we can start in R to perform inflammatory speech analyses:

labeled_speech_df <- read_csv("labeled_speech_df.csv")
# Distribution of labels check

# Count the occurrences of each label
label_counts <- labeled_speech_df |>
  group_by(label) |>
  summarize(count = n()) |>
  arrange(desc(count))

# Print the counts
print(label_counts)
## # A tibble: 3 × 2
##   label       count
##   <chr>       <int>
## 1 Normal      13032
## 2 Offensive      36
## 3 Hate speech     9

Which party uses Hate speech the most?

To examine this a function to get the top 5 parties that use offensive or hate speech the most is created with a relative frequency

# Load necessary libraries
library(dplyr)
library(ggplot2)

# Function to get top 5 parties by label with relative frequency
get_top_parties_relative <- function(df, speech_label) {
  df |>
    filter(label == speech_label) |>
    group_by(party) |>
    summarise(count = n()) |>
    mutate(relative_frequency = count / sum(count)) |>
    arrange(desc(count)) |>
    slice_head(n = 5) # Get top 5
}

# Get top 5 parties for Hate Speech
top_hate_speech_parties <- get_top_parties_relative(labeled_speech_df, "Hate speech")
top_hate_speech_parties
## # A tibble: 4 × 3
##   party                   count relative_frequency
##   <chr>                   <int>              <dbl>
## 1 conservative                6              0.667
## 2 labour                      1              0.111
## 3 labourco-operative          1              0.111
## 4 scottish-national-party     1              0.111
# Get top 5 parties for Offensive Speech
top_offensive_speech_parties <- get_top_parties_relative(labeled_speech_df, "Offensive")
top_offensive_speech_parties
## # A tibble: 5 × 3
##   party                   count relative_frequency
##   <chr>                   <int>              <dbl>
## 1 labour                     17             0.472 
## 2 conservative                7             0.194 
## 3 scottish-national-party     5             0.139 
## 4 labourco-operative          3             0.0833
## 5 liberal-democrat            2             0.0556

Then let´s visualize the results with a Dumbbell plot:

# Prepare data for Dumbbell Plot
combined_data <- bind_rows(
  top_hate_speech_parties |> mutate(type = "Hate Speech"),
  top_offensive_speech_parties |> mutate(type = "Offensive")
)

# Dumbbell plot
dumbbell_plot <- ggplot(combined_data, aes(x = relative_frequency, y = reorder(party, -relative_frequency), color = type)) +
  geom_point(size = 7, shape = 21, fill = "white", stroke = 2) +  # Bigger dots with white fill and black outline
  geom_line(aes(group = party), color = "grey50", linewidth = 1.2) +  # Use linewidth instead of size for lines
  scale_color_manual(values = c("Hate Speech" = "red3", "Offensive" = "dodgerblue3")) +
  geom_text(aes(label = round(relative_frequency, 2)), vjust = -1.5, size = 4.5, color = "black") +  # Position labels above dots
  labs(title = "Comparison of Hate Speech and Offensive Speech by Party",
       subtitle = "Top 5 parties with the highest relative frequency",
       x = "Relative Frequency", y = "Party") +
  theme_minimal(base_size = 15) +  # Increase base font size for readability
  theme(
    plot.title = element_text(face = "bold", size = 20, hjust = 0.5),  # Bold, centered title
    plot.subtitle = element_text(size = 15, hjust = 0.5),  # Centered subtitle
    axis.text.y = element_text(face = "bold"),  # Bold y-axis labels
    axis.title.x = element_text(margin = margin(t = 10)),  # Add margin to x-axis title
    legend.position = "top",  # Move legend to the top
    legend.key = element_blank(),  # Remove default shapes in legend
    legend.text = element_text(size = 12),
    legend.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey80"),  # Add major grid lines
    panel.grid.minor = element_blank()  # Remove minor grid lines
  ) +
  guides(color = guide_legend(title = "Type of Speech", override.aes = list(shape = 21, size = 6)))  # Customize legend keys

print(dumbbell_plot)

Is it something demographic?

To see if it is something demographic we need to plot the use of hate or offensive speech by constituency to see if any patterns can be inferred. For this the shapefile of the UK constituencies should be downloaded although it is also available on Github:

# Ensure constituency names match between dataframes (consider using string manipulation to match them)
labeled_speech_df <- labeled_speech_df |>
  mutate(speaker_constituency = tolower(speaker_constituency))

# Load the shapefile
library(sf)
shapefile_data <- st_read("uk_constituencies_2016.shp")
## Reading layer `uk_constituencies_2016' from data source 
##   `/Users/lauramartinez/Desktop/Master/TFM/TFM_final/uk_constituencies_2016.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 398 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 189.8508 ymin: 7440 xmax: 655619 ymax: 1218625
## Projected CRS: OSGB36 / British National Grid
shapefile_data <- shapefile_data |>
  mutate(area_name = tolower(area_name))

# Merge the datasets
joined_data <- labeled_speech_df |>
  inner_join(shapefile_data, by = c("speaker_constituency" = "area_name"))

# Create a new label
joined_data <- joined_data |>
  mutate(label_combined = case_when(
    label == "Hate speech" & label == "Offensive" ~ "Both",  # This condition needs to be logically correct
    label == "Hate speech" ~ "Hate speech",
    label == "Offensive" ~ "Offensive",
    TRUE ~ NA_character_
  ))

Once the data is merged, the map with the results can be plotted:

# Custom color mapping
speech_colors <- c("Hate speech" = "darkred", 
                   "Offensive" = "gold", 
                   "Both" = "orange")  # Add a color for both categories

# Plotting the map with filled constituencies for Hate speech and Offensive speech
ggplot() +
  geom_sf(data = shapefile_data, fill = "white", color = "black") +  # Map outline
  geom_sf(data = joined_data |> filter(!is.na(label_combined)), 
          aes(geometry = geometry, fill = label_combined), 
          color = "black", size = 0.3, alpha = 0.7) +  # Filled constituencies with black borders
  scale_fill_manual(values = speech_colors) +  # Custom fill colors
  labs(title = "Constituencies with Hate and Offensive Speech",
       fill = "Speech Type") +
  theme_minimal() +
  theme(
    legend.position = "bottom",  # Move legend to bottom
    legend.title = element_text(size = 12),  # Adjust legend title size
    legend.text = element_text(size = 8),  # Adjust legend text size
    legend.key.size = unit(0.5, "cm"),  # Adjust legend key size
    legend.spacing.x = unit(0.5, "cm"),  # Adjust spacing between legend items
    panel.grid.major = element_blank(),  # Remove major gridlines
    panel.grid.minor = element_blank(),  # Remove minor gridlines
    axis.title.x = element_blank(),  # Remove x-axis title
    axis.title.y = element_blank(),  # Remove y-axis title
    axis.text.x = element_blank(),   # Remove x-axis text
    axis.text.y = element_blank(),   # Remove y-axis text
    axis.ticks = element_blank(),    # Remove axis ticks
    plot.margin = margin(10, 10, 10, 10),  # Reduce margin around the plot
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold")  # Center title, adjust size
  ) +
  coord_sf(expand = FALSE)  # Prevents unnecessary space around the map

If we wanted to see the plotted constituency names and their speech labels:

# Filter for rows where the label is either "Hate speech" or "Offensive"
constituencies_with_labels <- joined_data |>
    select(speaker_constituency, label) |>
    filter(label == "Hate speech" | label == "Offensive")

# Display the result
print(constituencies_with_labels)
## # A tibble: 12 × 2
##    speaker_constituency label      
##    <chr>                <chr>      
##  1 cheltenham           Offensive  
##  2 gloucester           Hate speech
##  3 lincoln              Hate speech
##  4 midlothian           Offensive  
##  5 gloucester           Offensive  
##  6 middlesbrough        Offensive  
##  7 daventry             Hate speech
##  8 st albans            Offensive  
##  9 north somerset       Offensive  
## 10 north tyneside       Offensive  
## 11 mid sussex           Offensive  
## 12 north antrim         Offensive

Frequency of hate speech through time

To delve deeper on whether there are temporal patterns in the use of inflammatory language in the House of Commons also the use of this type of language through time to examine any trends:

  • To do so occurrences must be first grouped and counted by date

  • Then the timeline will be plotted

# Summarize the count of each label by month
speech_summary <- labeled_speech_df |> 
  group_by(date, label) |> 
  summarize(count = n()) |> 
  ungroup()

# Ensure the date column is in Date format
speech_summary$date <- as.Date(speech_summary$date)

# Filter the data to include only the "Hate speech" and "Offensive" labels
filtered_speech_summary <- speech_summary |>
  filter(label %in% c("Hate speech", "Offensive"))

monthly_summary <- filtered_speech_summary |>
  mutate(month = floor_date(date, "month")) |>
  group_by(month, label) |>
  summarize(count = sum(count)) |>
  ungroup()

# Plot results with a bubble chart
ggplot(monthly_summary, aes(x = month, y = label, size = count, color = label)) +
  geom_point(alpha = 0.7) +
  labs(title = "Frequency of Hate Speech and Offensive Speech Over Time",
       x = "Month",
       y = "Label",
       size = "Count",
       color = "Label") +
  theme_minimal() +
  scale_x_date(date_labels = "%Y-%m", date_breaks = "2 months") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Most conflictive category

Finally we might be interested in seeing which topic triggered inflammatory language the most, and to investigate this the speech category, the speech itself and its label will be filtered to keep only the ones that contain inflammatory language:

hate_offensive_categories <- labeled_speech_df |>
  filter(label %in% c("Hate speech", "Offensive")) |>
  select(category, label, speech) |> 
  print(n=45)
## # A tibble: 45 × 3
##    category                        label       speech                           
##    <chr>                           <chr>       <chr>                            
##  1 EU/Brexit                       Offensive   The right hon Gentleman may or m…
##  2 Social Security/Society/Welfare Hate speech Can my right hon Friend think of…
##  3 Social Security/Society/Welfare Hate speech Does the hon Gentleman know of a…
##  4 Social Security/Society/Welfare Hate speech My right hon Friend is being ver…
##  5 EU/Brexit                       Offensive   I would like to respond to the p…
##  6 EU/Brexit                       Hate speech Does my hon Friend agree that, d…
##  7 Employment                      Offensive   Is my hon Friend aware that when…
##  8 Housing                         Offensive   The right hon Gentleman has made…
##  9 Employment                      Hate speech Muslims around the world have ma…
## 10 EU/Brexit                       Offensive   Amendment 2 proposes leaving out…
## 11 EU/Brexit                       Offensive   Bin it!Is not this shambles of a…
## 12 EU/Brexit                       Hate speech I think that the hon Lady is str…
## 13 EU/Brexit                       Offensive   On a point of order, Mr Speaker.…
## 14 Parliamentary proceedings       Offensive   Absolute rubbish.                
## 15 Education                       Offensive   Does my right hon Friend agree t…
## 16 EU/Brexit                       Offensive   A disgrace!                      
## 17 Employment                      Offensive   Not from those nasty Tories.     
## 18 Economy/Technology              Offensive   We are getting to the crux of th…
## 19 Social Security/Society/Welfare Offensive   Does the right hon Gentleman agr…
## 20 Social Security/Society/Welfare Offensive   I am very grateful to the Libera…
## 21 Social Security/Society/Welfare Offensive   Speaking as one of the forces of…
## 22 Social Security/Society/Welfare Offensive   Exactly how much of that money w…
## 23 Energy/Environment              Offensive   May I put on record my condolenc…
## 24 Social Security/Society/Welfare Hate speech I join others in highlighting th…
## 25 Employment                      Offensive   The Chancellor has taken us thro…
## 26 EU/Brexit                       Hate speech I want to make an appeal to the …
## 27 Social Security/Society/Welfare Offensive   Does not new clause 13 expose th…
## 28 Social Security/Society/Welfare Offensive   It is a long-standing custom in …
## 29 Transport                       Offensive   Thank you very much, Mr Speaker.…
## 30 EU/Brexit                       Offensive   I am a new Member here, and most…
## 31 Housing                         Offensive   Does my hon Friend agree that an…
## 32 Housing                         Hate speech The right hon Gentleman might be…
## 33 EU/Brexit                       Offensive   I will try again. Am I now to as…
## 34 EU/Brexit                       Offensive   The right hon Gentleman is of co…
## 35 EU/Brexit                       Offensive   I agree with my hon Friend that …
## 36 EU/Brexit                       Offensive   My right hon Friend is making an…
## 37 EU/Brexit                       Offensive   The right hon and learned Gentle…
## 38 EU/Brexit                       Offensive   He is very irritating.           
## 39 EU/Brexit                       Offensive   You can’t be that stupid!        
## 40 EU/Brexit                       Offensive   I agree with the hon Gentleman’s…
## 41 EU/Brexit                       Offensive   Shocking.                        
## 42 EU/Brexit                       Offensive   My hon Friend is making an excel…
## 43 EU/Brexit                       Offensive   Does my right hon Friend recall …
## 44 Parliamentary proceedings       Offensive   I believe the right hon Gentlema…
## 45 EU/Brexit                       Offensive   My hon Friend says that Parliame…

Then the relative frequency of use of this type of language will be calculated, the top 3 topics or categories identified and finally the results plotted with a bar plot:

# Step 1: Calculate the frequency and relative frequency for each category
category_freq <- hate_offensive_categories |>
  group_by(category) |>
  summarise(count = n()) |>
  mutate(relative_frequency = count / sum(count)) |>
  arrange(desc(relative_frequency))

# Step 2: Identify the top 3 categories
top_categories <- category_freq |>
  slice_max(relative_frequency, n = 3)

# Step 3: Define a custom color palette with stronger colors
strong_colors <- c("brown3", "blue4", "gold2")  # Vibrant colors

# Step 4: Plot the top 3 categories
ggplot(top_categories, aes(x = reorder(category, -relative_frequency), y = relative_frequency, fill = category)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = scales::percent(relative_frequency, accuracy = 0.1)), 
            vjust = -0.5, size = 5) +
  labs(title = "Top 3 Categories by Relative Frequency of Hate and Offensive Speech",
       x = "Category", y = "Relative Frequency") +
  theme_minimal() +
  scale_fill_manual(values = strong_colors) +  # Apply custom strong colors
  theme(legend.position = "none",  # Hide the legend
        plot.title = element_text(size = 16, face = "bold"),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 14))

References:

BBC News. (2016, June 24). EU referendum results. BBC News. https://www.bbc.com/news/politics/eu_referendum/results

Cameron, D. (2013, January 23). EU speech at Bloomberg. GOV.UK. https://www.gov.uk/government/speeches/eu-speech-at-bloomberg

House of Commons. (n.d.). Members of Parliament after 1832. https://membersafter1832.historyofparliamentonline.org/members

Osnabrügge, M., Hobolt, S. B., & Rodon, T. (2021). Emotive rhetoric in legislative debates. Journal of Political Science, 57(1), 45-62.

Smith, J. (2017). Parliamentary debates and the rhetoric of Brexit: An analysis. Journal of British Politics, 12(2), 56-78.